Ruby strings and Time Zone name strings in Ruby aren’t encoded the same way. In the US, this doesn’t matter because Latin-1/Windows-1252/ISO-8859-1/IBM437 all translate byte-for-byte to UTF-8. However, once the letters move past the 26 English letters of the alphabet, things can get tricky. One such case arose with umlauts in timezones.
Ruby strings vs Ruby timezone strings
All strings in Ruby are “UTF-8” by default, right? Well, not the time zone names. On Windows it’s:
irb(main):004:0> Time.now.zone.encoding => #<Encoding:IBM437> irb(main):005:0> "Random string".encoding => #<Encoding:UTF-8>
Umlaut characters in similar character sets
Looking at Latin-1, IBM 437, and Windows 1252 character encodings, there aren’t any differences between the German characters of interest. They are all encoded the same way:
Character | Latin-1/ISO-8859-1 | IBM 437 | Windows-1252 | UTF-8 |
Ä | 0xC4 | 0xC4 | 0xC4 | 0xC3 0x84 |
Ö | 0xD6 | 0xD6 | 0xD6 | 0xC3 0x96 |
Ü | 0xDC | 0xDC | 0xDC | 0xC3 0x9C |
ä | 0xE4 | 0xE4 | 0xE4 | 0xC3 0xA4 |
ö | 0xF6 | 0xF6 | 0xF6 | 0xC3 0xB6 |
ß | 0xDF | 0xDF | 0xDF | 0xC3 0x9F |
ü | 0xFC | 0xFC | 0xFC | 0xC3 0xBC |
Differences in Reencoding
The default encoding of IBM 437 for the time zone string (explicitly forced or not) doesn’t encode to UTF-8 and preserve the umlaut character but ISO-8859-1 (Latin 1) and Windows-1252 do. Notice the sequence of bytes below from encoding “Mitteleuropäische Zeit” from the various encodings. Byte sequence [195, 164] is the proper sequence in UTF-8 to render “ä”. The ones from an implicit or explicit IBM437 encoding result in “∑” instead:
irb(main):034:0> Time.now.zone.force_encoding(Encoding::IBM437).encode(Encoding::UTF_8).bytes => [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 206, 163, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116] irb(main):035:0> Time.now.zone.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8).bytes => [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 195, 164, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116] irb(main):036:0> Time.now.zone.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8).bytes => [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 195, 164, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116] irb(main):037:0> Time.now.zone.encode(Encoding::UTF_8).bytes => [77, 105, 116, 116, 101, 108, 101, 117, 114, 111, 112, 206, 163, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116] irb(main):045:0> Time.now.zone.force_encoding(Encoding::ISO_8859_1).encode(Encoding::UTF_8) => "Mitteleuropäische Zeit" irb(main):046:0> Time.now.zone.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8) => "Mitteleuropäische Zeit" irb(main):047:0> Time.now.zone.force_encoding(Encoding::IBM437).encode(Encoding::UTF_8) => "MitteleuropΣische Zeit" # no encoding force (implicit IBM437) irb(main):048:0> Time.now.zone.encode(Encoding::UTF_8) => "MitteleuropΣische Zeit"
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
https://en.wikipedia.org/wiki/Code_page_437
https://github.com/chef/ohai/pull/1781/files
if time[:timezone].encoding == Encoding::IBM437
time[:timezone] = time[:timezone].force_encoding(Encoding::WINDOWS_1252)
time[:timezone] = time[:timezone].encode(Encoding::UTF_8)
end
bytes = [77, 105, 116, 116, 101, 108, 117, 114, 111, 112, 228, 105, 115, 99, 104, 101, 32, 90, 101, 105, 116]
Encoding.list.each do |source|
begin
puts source
puts bytes.pack("c*").force_encoding(source).encode("UTF-8")
rescue => Encoding::UndefinedConversionError
# puts source, "failed"
end
end
Troubleshooting the environment
Running chcp.com
with German display language set outputs "Aktive Codepage: 437."
(on English it’s the same with "Active code page: 437"
), which corresponds to (Encoding::IBM437
)
In the Ruby console, running Encoding.locale_charmap
outputs "CP437"
.
An opposite problem with %x
A similar problem arises with the output from %x
and IO.popen
, but in reverse. The output encoding is identified as #<Encoding:UTF-8>
. But if you create output that includes any characters in “ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ”, which is ASCII 128..154, the string output won’t match the characters. A quick way to test this is the following script in Ruby:
# create a file named "ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ", which is ASCII 128..154
FILENAME=(128..154).to_a.pack('c*').force_encoding(Encoding::IBM437)
UTF_8_FILENAME=FILENAME.encode(Encoding::UTF_8)
Dir.mkdir('tmp') unless Dir.exist?('tmp')
File.open(File.join('tmp', UTF_8_FILENAME), 'wt') do |f|
f.puts "test"
end
def check_encodings(output)
if output.include?(UTF_8_FILENAME)
puts "output matches without forcing encoding"
end
Encoding.list.each do |encoding|
if output.force_encoding(encoding).encode(Encoding::UTF_8).include?(UTF_8_FILENAME)
puts encoding
end
rescue
end
end
command=RUBY_PLATFORM =~ /mingw/ ? 'dir tmp' : 'ls tmp'
puts
puts "%x|#{command}|"
puts "-" * (command.length + 4)
check_encodings(%x|#{command}|)
puts
puts "IO.popen(#{command}).read"
puts "-" * (command.length + 15)
check_encodings(IO.popen(command).read)
The output of this is:
C:\Ruby31\bin\ruby.exe .\directory_test.rb
%x|dir tmp|
-----------
output encoding: #<Encoding:UTF-8>
Output can be made to match by forcing the following encodings:
IBM437
CP850
IBM865
IO.popen(dir tmp).read
----------------------
output encoding: #<Encoding:UTF-8>
Output can be made to match by forcing the following encodings:
IBM437
CP850
IBM865
PS C:\projects> dir
I’m using .force_encoding
to manually set the encoding to each available encoding from Encoding
, including UTF-8
, but only IBM437
, CP850
, and IBM865
work.
Ruby bugs opened
After working through the issues to confirm they weren’t weird configuration issues, I opened two bugs for the above problems.