Umlauts in Timezones and %x/popen output on Windows Ruby

Ruby strings and Time Zone name strings in Ruby aren’t encoded the same way. In the US, this doesn’t matter because Latin-1/Windows-1252/ISO-8859-1/IBM437 all translate byte-for-byte to UTF-8. However, once the letters move past the 26 English letters of the alphabet, things can get tricky. One such case arose with umlauts in timezones.

Ruby strings vs Ruby timezone strings

All strings in Ruby are “UTF-8” by default, right? Well, not the time zone names. On Windows it’s:

irb(main):004:0> Time.now.zone.encoding
=> #<Encoding:IBM437>
irb(main):005:0> "Random string".encoding
=> #<Encoding:UTF-8>

Umlaut characters in similar character sets

Looking at Latin-1, IBM 437, and Windows 1252 character encodings, there aren’t any differences between the German characters of interest. They are all encoded the same way:

Character	Latin-1/ISO-8859-1	IBM 437	Windows-1252	UTF-8
Ä	0xC4	0xC4	0xC4	0xC3 0x84
Ö	0xD6	0xD6	0xD6	0xC3 0x96
Ü	0xDC	0xDC	0xDC	0xC3 0x9C
ä	0xE4	0xE4	0xE4	0xC3 0xA4
ö	0xF6	0xF6	0xF6	0xC3 0xB6
ß	0xDF	0xDF	0xDF	0xC3 0x9F
ü	0xFC	0xFC	0xFC	0xC3 0xBC

Latin-1, IBM 437, Windows 1252, and UTF-8 byte values

Differences in Reencoding

The default encoding of IBM 437 for the time zone string (explicitly forced or not) doesn’t encode to UTF-8 and preserve the umlaut character but ISO-8859-1 (Latin 1) and Windows-1252 do. Notice the sequence of bytes below from encoding “Mitteleuropäische Zeit” from the various encodings. Byte sequence [195, 164] is the proper sequence in UTF-8 to render “ä”. The ones from an implicit or explicit IBM437 encoding result in “∑” instead:

irb(main):034:0> Time.now.zone.force_encoding(Encoding::IBM437).encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 206, 163, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
irb(main):035:0> Time.now.zone.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 195, 164, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
irb(main):036:0> Time.now.zone.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 195, 164, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
irb(main):037:0> Time.now.zone.encode(Encoding::UTF_8).bytes
=> [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 206, 163, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]

irb(main):045:0> Time.now.zone.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8)
=> "Mitteleuropäische Zeit"
irb(main):046:0> Time.now.zone.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8)
=> "Mitteleuropäische Zeit"
irb(main):047:0> Time.now.zone.force_encoding(Encoding::IBM437).encode(Encoding::UTF_8)
=> "MitteleuropΣische Zeit"
# no encoding force (implicit IBM437)
irb(main):048:0> Time.now.zone.encode(Encoding::UTF_8)
=> "MitteleuropΣische Zeit"

https://en.wikipedia.org/wiki/Windows-1252

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

https://en.wikipedia.org/wiki/Code_page_437

https://github.com/chef/ohai/pull/1781/files

     if time[:timezone].encoding == Encoding::IBM437
      time[:timezone] = time[:timezone].force_encoding(Encoding::WINDOWS_1252)
      time[:timezone] = time[:timezone].encode(Encoding::UTF_8)
    end

bytes = [77, 105, 116, 116, 101, 108, 117, 114, 111, 112, 228, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]

Encoding.list.each do |source|
  begin
    puts source
    puts bytes.pack("c*").force_encoding(source).encode("UTF-8")
  rescue => Encoding::UndefinedConversionError
#    puts source, "failed"
  end
end

Troubleshooting the environment

Running chcp.com with German display language set outputs "Aktive Codepage: 437." (on English it’s the same with "Active code page: 437"), which corresponds to (Encoding::IBM437)

In the Ruby console, running Encoding.locale_charmap outputs "CP437".

An opposite problem with %x

A similar problem arises with the output from %x and IO.popen, but in reverse. The output encoding is identified as #<Encoding:UTF-8>. But if you create output that includes any characters in “ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ”, which is ASCII 128..154, the string output won’t match the characters. A quick way to test this is the following script in Ruby:

 # create a file named "ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ", which is ASCII 128..154
FILENAME=(128..154).to_a.pack('c*').force_encoding(Encoding::IBM437)
UTF_8_FILENAME=FILENAME.encode(Encoding::UTF_8)

Dir.mkdir('tmp') unless Dir.exist?('tmp')
File.open(File.join('tmp', UTF_8_FILENAME), 'wt') do |f|
  f.puts "test"
end


def check_encodings(output)
  if output.include?(UTF_8_FILENAME)
    puts "output matches without forcing encoding"
  end

  Encoding.list.each do |encoding|
    if output.force_encoding(encoding).encode(Encoding::UTF_8).include?(UTF_8_FILENAME)
      puts encoding
    end
  rescue
  end
end


command=RUBY_PLATFORM =~ /mingw/ ? 'dir tmp' : 'ls tmp'
puts
puts "%x|#{command}|"
puts "-" * (command.length + 4)
check_encodings(%x|#{command}|)
puts
puts "IO.popen(#{command}).read"
puts "-" * (command.length + 15)
check_encodings(IO.popen(command).read)

The output of this is:

 C:\Ruby31\bin\ruby.exe .\directory_test.rb

%x|dir tmp|
-----------
output encoding: #&lt;Encoding:UTF-8>
Output can be made to match by forcing the following encodings:
IBM437
CP850
IBM865

IO.popen(dir tmp).read
----------------------
output encoding: #&lt;Encoding:UTF-8>
Output can be made to match by forcing the following encodings:
IBM437
CP850
IBM865
PS C:\projects> dir

I’m using .force_encoding to manually set the encoding to each available encoding from Encoding, including UTF-8, but only IBM437, CP850, and IBM865 work.

Ruby bugs opened

After working through the issues to confirm they weren’t weird configuration issues, I opened two bugs for the above problems.