Ruby 1.9 Encodings: A Primer and the Solution for Rails

UPDATE: The DataObjects drivers, which are used in DataMapper, are now updated to honor default_internal. Let's keep this moving.

Since Ruby 1.9 announced support for encodings, there has been a flurry of activity to make existing libraries encoding aware, and a tornado of confusion as users of Ruby and Rails have tried to make sense of it.

In this post, I will lay out the most common problems people have had, and what we can do as a community to put these issues to bed in time for Ruby 1.9.2 final.

A Quick Tour

I'm going to simplify some of this, but the broad strokes are essentially correct.

Before we begin, many of you are probably wonder what exactly an "encoding" is. For me, getting a handle on this was an important part of helping me understand the possible solution space.

On disk, all Strings are stored as a sequence of bytes. An encoding simply specifies how to take those bytes and convert them into "codepoints". In some languages, such as English, a "codepoint" is exactly equivalent to "a character". In most other languages, there is not a one-to-one correspondence. For example, a German codepoint might specify that the next codepoint should get an ümlaut.

The list of English characters represented by the first seven bits of ASCII (characters 0 through 127 in "ASCII-7") have the same representation in many (but not all) encodings. This means that if you only use English characters, the on-disk representation of the characters will often be exactly the same regardless of the source encoding.

However, once you start to use other characters, the bytes on disk mean different things in different encodings. Have you ever seen a page on the Internet filled with something like "FÃ¼hrer"? That is the consequence of the bytes of "Führer" stored as UTF-8 being interpreted as Latin-1.

You can trivially see this problem using Ruby 1.9's encoding support by running this program:

# encoding: UTF-8

puts "hello ümlaut".force_encoding("ISO-8859-1").encode("UTF-8")

# Output
# hello Ã¼mlat

First, we create a String ("hello ümlaut") in the UTF-8 encoding. Next, we tell Ruby that the String is actually Latin-1. It's not, so an attempt to read the characters will interpret the raw bytes of the "ü" as though they were Latin-1 bytes. We ask Ruby to give us that interpretation of the data in UTF-8 via encode and print it out.

We can see that while the bytes for "hello " and "mlat" were identical in both UTF-8 and Latin-1, the bytes for "ü" in UTF-8 mean "Ã¼" in Latin-1.

Note that while force_encoding simply tags the String with a different encoding, encode converts the bytes of one encoding into the equivalent bytes of the second. As a result, while force_encoding should almost never be used unless you know for sure that the bytes actually represent the characters you want in the target encoding, encode is relatively safe to use to convert a String into the encoding you want.

You've probably also seen the reverse problem, where bytes encoded in Latin-1 ended up inside a page encoded in UTF-8.

# encoding: ISO-8859-1

puts "hello ümlaut".force_encoding("UTF-8")

# Output
# hello ?mlat

Here, the sequence of bytes that represents an "ü" in Latin-1 could not be recognized in UTF-8, so they were replaced with a "?". Note that puts will always simply write out the bytes to your terminal, and the terminal's encoding will determine how they are interpreted. The examples in this post are all outputted to a terminal using UTF-8 encoding.

As you can imagine, this presents quite the issue when concatenating two Strings of different encodings. Simply smashing together the raw bytes of the two Strings can result in output that is incomprehensible in either encoding. To make matters worse, it's not always possible to represent all of the characters in one encoding in another. For instance, the characters of the Emoji encoding cannot be represented in the ISO-8859-1 encoding (or even in a standardized way onto the UTF-8 encoding).

As a result, when you attempt to concatenate two Strings of different encodings in Ruby 1.9, Ruby displays an error.

# encoding: UTF-8

puts "hello ümlaut".encode("ISO-8859-1") + "hello ümlaut"

# Output
# incompatible character encodings: ISO-8859-1 and UTF-8 (Encoding::CompatibilityError)

Because it's extremely tricky for Ruby to be sure that it can make a lossless conversion from one encoding to another (Ruby supports almost 100 different encodings), the Ruby core team has decided to raise an exception if two Strings in different encodings are concatenated together.

There is one exception to this rule. If the bytes in one of the two Strings are all under 127 (and therefore valid characters in ASCII-7), and both encodings are compatible with ASCII-7 (meaning that the bytes of ASCII-7 represent exactly the same characters in the other encoding), Ruby will make the conversion without complaining.

# encoding: UTF-8

puts "hello umlat".encode("ISO-8859-1") + "hello ümlaut"

# Output
# hello umlathello ümlaut

Since Ruby does not allow characters outside of the ASCII-7 range in source files without a declared encoding, this exception eliminates a large number of potential problems that Ruby's strict concatenation rules might have introduced.

Binary Strings

By default, Strings with no encoding in Ruby are tagged with the ASCII-8BIT encoding, which is an alias for BINARY. Essentially, this is an encoding that simply means "raw bytes here".

In general, code in Rails applications should not encounter BINARY strings, except for Strings created in source files without encodings. However, since these Strings will virtually always fall under the ASCII-7 exception, Ruby programmers should never have to deal with incompatible encoding exceptions where one of the two encodings is ASCII-8BIT (i.e. BINARY).

That said, almost all of the encoding problems reported by users in the Rails bug tracker involved ASCII-8BIT Strings. How did this happen?

There are two reasons for this.

The first reason is that early on, database drivers generally didn't properly tag Strings they retrieved from the database with the proper encoding. This involves a manual mapping from the database's encoding names to Ruby's encoding names. As a result, it was extremely common from database drivers to return Strings with characters outside of the ASCII-7 range (because the original content was encoded in the database as UTF-8 or ISO-8859-1/Latin-1).

When attempting to concatenate that content onto another UTF-8 string (such as the buffer in an ERB template), Ruby would raise an incompatible encoding exception.

# encoding: UTF-8

puts "hello ümlaut" + "hello ümlaut".force_encoding("BINARY")

# Output
# incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

This is essentially identical to the scenario many people encountered. A UTF-8 String was presented to Ruby as a BINARY String, since the database driver didn't tag it. When attempting to concatenate it onto UTF-8, Ruby had no way to do so reliably, so it raised an exception.

One reason that many people didn't encounter this problem was that either the contents of the template or the text from the database were entirely in the ASCII-7 subset of their character set. As a result, Ruby would not complain. This is deceptive, because if they made a small change to their template, or if a user happened to enter non-ASCII-7 data (for instance, they got their first user named José), they would suddenly start seeing an incompatible encoding exception.

When people see this incompatible encoding exception, one common reaction is to call force_encoding("UTF-8") on the BINARY data. This will work great for Strings whose bytes actually are encoded in UTF-8. However, if people whose Strings were encoded in ISO-8859-5 (Russian) followed this instruction, they would end up with scrambled output.

Additionally, it's impossible to simply encode the data, since Ruby doesn't actually know the source encoding. In essence, a crucial piece of information has been lost at the database driver level.

Unfortunately, this means that well-meaning people who have solved their problem by force_encoding their Strings to UTF-8 (because the bytes actually did represent UTF-8 characters) become baffled when their solution doesn't work for someone working on a Russian website.

Thankfully, this situation is now mostly solved. There are updates for all database drivers that map the encodings from the database to a Ruby encoding, which means that UTF-8 text from the database will be UTF-8 Strings in Ruby, and Latin-1 text from the database will be ISO-8859-1 Strings in Ruby.

Unfortunately, there is a second large source of BINARY Strings in Ruby. Specifically, data received from the web in the form of URL encoded POST bodies often do not specify the content-type of the content sent from forms.

In many cases, browsers send POST bodies in the encoding of the original document, but not always. In addition, some browsers say that they're sending content as ISO-8859-1 but actually send it in Windows-1251. There is a long thread on the Rack tracker about this, but the bottom line is that it's extremely difficult to determine the encoding of a POST body sent from the web.

As a result, Rack handlers send the raw bytes through as BINARY (which is reasonable, since handlers shouldn't be in the business of trying to wade through this problem) and no middleware exists (yet) to properly tag the String with the correct encoding.

This means that if the stars align, the raw bytes are UTF-8, end up in a UTF-8 database, and end up coming back out again tagged as UTF-8. If the stars do not align, the text might actually be encoded in ISO-8859-1, get put into a UTF-8 database, and come out tagged as UTF-8 (and we know what happens when ISO-8859-1 data is mistakenly tagged as UTF-8).

In this case, because the ISO-8859-1 data is improperly tagged as UTF-8, Ruby happily concatenates it with other UTF-8 Strings, and hilarity ensues.

Because English characters have the same byte representation in all commonly used encodings, this problem is not as common as you might imagine. Unfortunately, this simply means that people who do encounter it are baffled and find it hard to get help. Additionally, this problem doesn't manifest itself as a hard error. it can go unnoticed and dismissed as a minor annoyance if the number of non-ASCII-7 characters are low.

In order to properly solve this problem for Ruby 1.9, we need a very good heuristic for properly determining the encoding of web-sent POST bodies. There are some promising avenues that will get it right 99.9% of the time, and we need to package them into up a middleware that will tag Strings correctly.

Incompatible Encodings

If you've been paying attention, you've probably noticed that while the database drivers have solved one problem, they actually introduced another one.

Imagine that you're using a MySQL database encoded in ISO-8859-1 (or ISO-8859-5, popular for Russian applications, or any other non-UTF-8 encoding). Now that the String coming back from the database is properly tagged as ISO-8859-1, Ruby will refuse to concatenate it onto the ERB buffer (which is encoded in UTF-8). Even if we solved this problem for ERB, it could be trivially reintroduced in other parts of the application through regular concatenation (+, concat, or even String interpolation).

Again, this problem is somewhat mitigated due to the ASCII-7 subset exception, which means that as long as one of the two incompatible Strings uses only English characters, users won't see any problems. Again, because this "solution" means that the Ruby developer in question still may not understand encodings, this simply defers the problem to some uncertain point in the future when they either add a non-ASCII-7 character to their template or the user submits a non-ASCII-7 String.

The Solution

If you got this far, you're probably thinking "Holy shit this encoding stuff is crazy. I don't want to have to know any of this! I just want to write my web app!"

And you'd be correct.

Other languages, such as Java and Python, solve this problem by encodeing every String that enters the language as UTF-8 (or UTF-16). Theoretically, it is possible to represent the characters of every encoding in UTF-8. By doing this, programmers only ever deal with one kind of String, and concatenation happens between UTF-8 Strings.

However, this solution does not work very well for the Japanese community. For a variety of complicated reasons, Japanese encoding, such as SHIFT-JIS, are not considered to losslessly encode into UTF-8. As a result, Ruby has a policy of not attempting to simply encode any inbound String into UTF-8.

This decision is debatable, but the fact is that if Ruby transparently transcoded all content into UTF-8, a large portion of the Ruby community would see invisible lossy changes to their content. That part of the community is willing to put up with incompatible encoding exceptions because properly handling the encodings they regularly deal with is a somewhat manual process.

On the other hand, many Rails applications work mostly with encodings that trivially encode to UTF-8 (such as UTF-8 itself, ASCII, and the ISO-8859-1 family). For this rather large part of the community, having to manually encode Strings to solve incompatible encoding problem feels like a burden that belongs on the machine has been inappropriately shifted onto Rails application developers.

But there is a solution.

By default, Ruby should continue to support Strings of many different encodings, and raise exceptions liberally when a developer attempts to concatenate Strings of different encodings. This would satisfy those with encoding concerns that require manual resolution.

Additionally, you would be able to set a preferred encoding. This would inform drivers at the boundary (such as database drivers) that you would like them to convert any Strings that they tag with an encoding to your preferred encoding immediately. By default, Rails would set this to UTF-8, so Strings that you get back from the database or other external source would always be in UTF-8.

If a String at the boundary could not be converted (for instance, if you set ISO-8859-1 as the preferred encoding, this would happen a lot), you would get an exception as soon as that String entered the system.

In practice, almost all usage of this setting would be to specify UTF-8 as a preferred encoding. From your perspective, if you were dealing in UTF-8, ISO-8859-* and ASCII (most Western developers), you would never have to care about encodings.

Even better, Ruby already has a mechanism that is mostly designed for this purpose. In Ruby 1.9, setting Encoding.default_internal tells Ruby to encode all Strings crossing the barrier via its IO system into that preferred encoding. All we'd need, then, is for maintainers of database drivers to honor this convention as well.

It doesn't require any changes to Ruby itself, and places the burden squarely on the few people who already need to deal with encodings (because taking data from outside of Ruby, via C, always already requires a tagging step). I have spoken with Aaron Patterson, who has been working on the SQLite3 driver, and he feels that this change is simple enough for maintainers of drivers dealing with external Strings to make it a viable option. He has already patched SQLite3 to make it respect default_internal.

However you feel about Ruby's solution to expose String encodings directly in the language, you should agree that since we're stuck with it for the forseeable future, this solution shifts the burden of dealing with it from the unwashed masses (most of whom have no idea what an encoding is) to a few maintainers of C extensions and libraries that deal in binary data. Getting this right as soon as possible will substantially ease the transition from Ruby 1.8 to Ruby 1.9.

Postscript: What Happened in 1.8!?

When people first move to 1.9 and encounter these shenanigans, they often wonder why everything seemed so simple in Ruby 1.8, and yet seemed to work.

There are a few reasons for this.

First, keep in mind that in Ruby 1.8, Strings are simple sequences of bytes. Ruby String operations just concatenate those byte sequences together without any kind of check. This means that concatenating two UTF-8 Strings together will just work, since the combined byte sequence is still valid UTF-8. As long as the client for the Ruby code (such as the browser) is told that the bytes are encoded in UTF-8, all is well. Rails does this by setting the default charset for all documents to UTF-8.

Second, Ruby 1.8 has a "UTF-8" mode that makes its regular expression engine treat all Strings as UTF-8. In this mode (which is triggered by setting $KCODE = "UTF-8"), the regular expression engine correctly matches a complete UTF-8 character for /./, for instance. Rails sets this global by default, so if you were using Rails, regular expressions respect unicode characters, not raw bytes.

Third, very little non-English content in the wild is actually encoded in ISO-8859-1. If you were expecting to deal with content that was not English, you would probably set your MySQL database to use a UTF-8 encoding. Since Rails sets UTF-8 as the charset of outbound documents, most browsers will in fact return UTF-8 encoded data.

Fourth, the problems caused when an ISO-8859-1 String is accidentally concatenated into a UTF-8 String are not as jarring as the errors produced by Ruby 1.9. Let's try a little experiment. First, open up a text editor, create a new file, and save it in the ISO-8859-1 encoding.

$KCODE = "UTF-8"

iso_8859_1 = "ümlaut"

# the byte representation of ümlaut in unicode
utf8 = "\xC3\xBCmlat"

puts iso_8859_1
puts utf8

puts iso_8859_1 + utf8
puts utf8 + iso_8859_1

# Output
# ?mlat
# ümlaut
# ?mlatümlaut
# ümlaut?mlat

If you somehow get ISO-8859-1 encoded content that uses characters outside of the ASCII-7 range, Ruby doesn't puke. Instead, it simply replaces the unidentified character with a "?", which can easily go unnoticed in a mostly English site with a few "José"s thrown into the mix. It could also easily be dismissed as a "weird bug that we don't have time to figure out right now".

Finally, Rails itself provides a pure-Ruby UTF-8 library that mops up a lot of the remaining issues. Specifically, it provides an alternate String class that can properly handle operations like split, truncate, index, justify and other operations that need to operate on characters, not bytes. It then uses this library internally in helpers like truncate, transparently avoiding a whole other class of issue.

In short, if you're dealing mostly with English text, and you get unlucky enough the get ISO-8859-1 input from somewhere, the worst case is that you get a "?" instead of a "é". If you're dealing with a lot of non-English text, you're probably being not using ISO-8859-1 sources. In either case, English (ASCII) text is compatible with UTF-8, and Rails provides solid enough pure-Ruby UTF-8 support to get you most of the rest of the way.

That said, anyone dealing with encodings other than UTF-8 and ISO-8859-1 (Japanese and Russian Rubyists) were definitely not in a good place with Ruby 1.8.

Thanks

I want to personally thank Jay Freeman (aka saurik), who in addition to being a general badass, spent about 15 hours with me patiently explaining these issues and working through the Ruby 1.9 source to help fully understand the tradeoffs available.