Umlauts in Timezones and %x/popen output on Windows Ruby


Ruby strings and Time Zone name strings in Ruby aren’t encoded the same way. In the US, this doesn’t matter because Latin-1/Windows-1252/ISO-8859-1/IBM437 all translate byte-for-byte to UTF-8. However, once the letters move past the 26 English letters of the alphabet, things can get tricky. One such case arose with umlauts in timezones.

Ruby strings vs Ruby timezone strings

All strings in Ruby are “UTF-8” by default, right? Well, not the time zone names. On Windows it’s:

irb(main):004:0> Time.now.zone.encoding
=> #<Encoding:IBM437>
irb(main):005:0> "Random string".encoding
=> #<Encoding:UTF-8>

Umlaut characters in similar character sets

Looking at Latin-1, IBM 437, and Windows 1252 character encodings, there aren’t any differences between the German characters of interest. They are all encoded the same way:

CharacterLatin-1/ISO-8859-1IBM 437Windows-1252UTF-8
Ä0xC40xC40xC40xC3 0x84
Ö0xD60xD60xD60xC3 0x96
Ü0xDC0xDC0xDC0xC3 0x9C
ä0xE40xE40xE40xC3 0xA4
ö0xF60xF60xF60xC3 0xB6
ß0xDF0xDF0xDF0xC3 0x9F
ü0xFC0xFC0xFC0xC3 0xBC
Latin-1, IBM 437, Windows 1252, and UTF-8 byte values

Differences in Reencoding

The default encoding of IBM 437 for the time zone string (explicitly forced or not) doesn’t encode to UTF-8 and preserve the umlaut character but ISO-8859-1 (Latin 1) and Windows-1252 do. Notice the sequence of bytes below from encoding “Mitteleuropäische Zeit” from the various encodings. Byte sequence [195, 164] is the proper sequence in UTF-8 to render “ä”. The ones from an implicit or explicit IBM437 encoding result in “∑” instead:

irb(main):034:0> Time.now.zone.force_encoding(Encoding::IBM437).encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 206, 163, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
irb(main):035:0> Time.now.zone.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 195, 164, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
irb(main):036:0> Time.now.zone.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 195, 164, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
irb(main):037:0> Time.now.zone.encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 206, 163, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]

irb(main):045:0> Time.now.zone.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)
=> "Mitteleuropäische Zeit"
irb(main):046:0> Time.now.zone.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8)
=> "Mitteleuropäische Zeit"
irb(main):047:0> Time.now.zone.force_encoding(Encoding::IBM437).encode(Encoding::UTF_8)
=> "MitteleuropΣische Zeit"
# no encoding force (implicit IBM437)
irb(main):048:0> Time.now.zone.encode(Encoding::UTF_8)
=> "MitteleuropΣische Zeit"
 

https://en.wikipedia.org/wiki/Windows-1252

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

https://en.wikipedia.org/wiki/Code_page_437

https://github.com/chef/ohai/pull/1781/files

     if time[:timezone].encoding == Encoding::IBM437
      time[:timezone] = time[:timezone].force_encoding(Encoding::WINDOWS_1252)
      time[:timezone] = time[:timezone].encode(Encoding::UTF_8)
    end
bytes = [77, 105, 116, 116, 101, 108, 117, 114, 111, 112, 228, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]

Encoding.list.each do |source|
  begin
    puts source
    puts bytes.pack("c*").force_encoding(source).encode("UTF-8")
  rescue => Encoding::UndefinedConversionError
#    puts source, "failed"
  end
end 

Troubleshooting the environment

Running chcp.com with German display language set outputs "Aktive Codepage: 437." (on English it’s the same with "Active code page: 437"), which corresponds to (Encoding::IBM437)

In the Ruby console, running Encoding.locale_charmap outputs "CP437".

An opposite problem with %x

A similar problem arises with the output from %x and IO.popen, but in reverse. The output encoding is identified as #<Encoding:UTF-8>. But if you create output that includes any characters in “ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ”, which is ASCII 128..154, the string output won’t match the characters. A quick way to test this is the following script in Ruby:

 # create a file named "ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ", which is ASCII 128..154
FILENAME=(128..154).to_a.pack('c*').force_encoding(Encoding::IBM437)
UTF_8_FILENAME=FILENAME.encode(Encoding::UTF_8)

Dir.mkdir('tmp') unless Dir.exist?('tmp')
File.open(File.join('tmp', UTF_8_FILENAME), 'wt') do |f|
  f.puts "test"
end


def check_encodings(output)
  if output.include?(UTF_8_FILENAME)
    puts "output matches without forcing encoding"
  end

  Encoding.list.each do |encoding|
    if output.force_encoding(encoding).encode(Encoding::UTF_8).include?(UTF_8_FILENAME)
      puts encoding
    end
  rescue
  end
end


command=RUBY_PLATFORM =~ /mingw/ ? 'dir tmp' : 'ls tmp'
puts
puts "%x|#{command}|"
puts "-" * (command.length + 4)
check_encodings(%x|#{command}|)
puts
puts "IO.popen(#{command}).read"
puts "-" * (command.length + 15)
check_encodings(IO.popen(command).read)

The output of this is:

 C:\Ruby31\bin\ruby.exe .\directory_test.rb

%x|dir tmp|
-----------
output encoding: #<Encoding:UTF-8>
Output can be made to match by forcing the following encodings:
IBM437
CP850
IBM865

IO.popen(dir tmp).read
----------------------
output encoding: #<Encoding:UTF-8>
Output can be made to match by forcing the following encodings:
IBM437
CP850
IBM865
PS C:\projects> dir

I’m using .force_encoding to manually set the encoding to each available encoding from Encoding, including UTF-8, but only IBM437, CP850, and IBM865 work.

Ruby bugs opened

After working through the issues to confirm they weren’t weird configuration issues, I opened two bugs for the above problems.


Leave a Reply

%d bloggers like this: