Encodings, Unabridged

I wrote somewhat extensively about the problem of encodings in Ruby 1.9 in general last week.

For those who didn't read that post, let me start with a quick refresher.

What's an Encoding?

An encoding specifies how to take a list of characters (such as "hello") and persist them onto disk as a sequence of bytes. You're probably familiar with the ASCII encoding, which specifies how to store English characters in a single byte each (taking up the space in 0-127, leaving 128-255 empty).

Another common encoding is ISO-8859-1 (or Latin-1), which uses ASCII's designation for the first 127 characters, and designates the numbers 128-255 for Latin characters (such as "é" or "ü").

Obviously, 255 characters isn't enough for all languages, so there are a number of ISO-8859-* encodings which each designate numbers 128-255 for their own purposes (for instance, ISO-8859-5 uses that space for Russian characters).

Unfortunately, the raw bytes themselves do not contain an "encoding specifier" or any kind, and the exact same bytes can either mean something in Western characters, Russian, Japanese, or any other language, depending on the character set that was originally used to store off the characters as bytes.

As a general rule, protocols (such as HTTP), provide a mechanism for specifying the encoding. For instance, in HTTP, you can specify the encoding in the Content-Type header, like this: Content-Type: text/html; charset=UTF-8. However, this is not a requirement, so it is possible to receive some content over HTTP and not know its encoding.

This brings us to an important point: Strings have no inherent encoding. By default, Strings are just BINARY data. Since the data could be encoded using any number of different incompatible encodings, simply combining BINARY data from different sources could easily result in a corrupted String.

When you see a diamond with a question mark inside on the web, or gibberish characters (like a weird A with a 3/4 symbol), you're seeing a mistaken attempt to combine binary data encoded differently into a single String.

What's Unicode

Unicode is an effort to map every known character (to a point) to a number. Unicode does not define an encoding (how to represent those numbers in bytes). It simply provides a unique number for each known character.

Unicode tries to unify characters from different encodings that represent the same character. For instance, the A in ASCII, the A in ISO-8859-1, and the A in the Japanese encoding SHIFT-JIS all map to the same Unicode character.

Unicode also takes pains to ensure round-tripping between existing encodings and Unicode. Theoretically, this should mean that it's possible to take some data encoded using any known encoding, use Unicode tables to map the characters to Unicode numbers, and then use the reverse versions of those tables to map the Unicode numbers back into the original encoding.

Unfortunately, both of these characteristics cause some problems for Asian character sets. First, there have been some historical errors in the process of unification, which requires the Unicode committee to properly identify which characters in different existing Chinese, Japanese and Korean (CJK) character sets actually represent the same character.

In Japanese, personal names use slight variants of the non-personal-name version of the same character. This would be equivalent to the difference (in English) between "Cate" and "Kate". Many of these characters (sometimes called Gaiji) cannot be represented in Unicode at all.

Second, there are still hundreds of characters in some Japanese encodings (such as the Microsoft encoding to SHIFT-JIS called CP932 or Windows-31J) that simply do not round-trip through Unicode.

To make matters worse, Java and MySQL use a different mapping table than the standard Unicode mapping tables (making "This costs ¥5" come out in Unicode as "This costs \5"). The standard Unicode mapping tables handle this particular case correctly (but cannot fully solve the round-tripping problem), but these quirks only serve to further raise doubts about Unicode in the minds of Japanese developers.

For a lot more information on these issues, check out the XML Japanese Profile document created by the W3C to explain how to deal with some of these problems in XML documents.

In the Western world, all encodings in use do not have these problems. For instance, it is trivial to take a String encoded as ISO-8859-1, convert it into Unicode, and then convert it back into ISO-8859-1 when needed.

This means that for most of the Western world, it is a good idea to use Unicode as the "one true character set" inside programming languages. This means that programmers can treat Strings as simple sequences of Unicode code points (several code points may add up to a single character, such as the ¨ code point, which can be applied to other code points to form characters like ü). In the Asian world, while this can sometimes be a good strategy, it is often significantly simpler to use the original encoding and handle merging Strings in different encodings together manually (when an appropriate decision about the tradeoffs around fidelity can be made).

Before I continue, I would note that the above is a vast simplification of the issues surrounding Unicode and Japanese. I believe it to be a fair characterization, but die-hard Unicode folks, and die-hard anti-Unicode folks would possibly disagree with some elements of it. If I have made any factual errors, please let me know.

A Digression: UTF-*

Until now, I have talked only about "Unicode", which simply maps code points to numbers. Because Unicode uses counting numbers, it can accommodate as many code points as it wants.

However, it is not an encoding. In other words, it does not specify how to store the numbers on disk. The most obvious solution would be to use a few bytes for each character. This is the solution that UTF-32 uses, specifying that each Unicode character be stored as four bytes (accommodating over 4 billion characters). While this has the advantage of being simple, it also uses huge amounts of memory and disk space compared to the original encodings (like ASCII, ISO-8859-1 and SHIFT-JIS) that it is replacing.

On the other side of the spectrum is UTF-8. UTF-8 uses a single byte for English characters, using the exact same mapping as ASCII. This means that a UTF-8 string that contains only characters found in ASCII will have the identical bytes as a String stored in ASCII.

It then uses the high bit (the bytes representing 128-255) to specify a series of escape characters that can specify a multibyte character. This means that Strings using Western characters use relatively few bytes (often comparable with the original encodings Unicode replaces), because they are in the low area of the Unicode space, while the large number of characters in the Asian languages use more bytes than their native encodings, because they use characters with larger Unicode numbers.

This is another reason some Asian developers resent Unicode; while it does not significantly increase the memory requirements for most Western documents, it does so for Asian documents.

For the curious, UTF-16 uses 16-bits for the most common characters (the BMP, or basic multilingual plane), and 32-bits to represent characters from planes 1 through 16. This means that UTF-8 is most efficient for Strings containing mostly ASCII characters. UTF-8 and UTF-16 are approximately equivalent for Strings containing mostly characters outside ASCII but inside the the BMP. For Strings containing mostly characters outside the BMP, UTF-8, UTF-16, and UTF-32 are approximately equivalent. Note that when I say "approximately equivalent", I'm not saying that they're exactly the same, just that the differences are small in large Strings.

Of the Unicode encodings, only UTF-8 is compatible with ASCII. By this I mean that if a String is valid ASCII, it is also valid UTF-8. UTF-16 and UTF-32 encode ASCII characters using two or four bytes.

What Ruby 1.9 Does

Accepting that there are two very different ways of handling this problem, Ruby 1.9 has a String API that is somewhat different from most other languages, mostly influenced by the issues I described above in dealing with Japanese in Unicode.

First, Ruby does not mandate that all Strings be stored in a single internal encoding. Unfortunately, this is not possible to do reliably with common Japanese encodings (CP932, aka Windows-31J has 300 characters than cannot round-trip through Unicode without corrupting data). It is possible that the Unicode committee will some day fully solve these problems to everyone's satisfaction, but that day has not yet come.

Instead, Ruby 1.9 stores Strings as the original sequence of bytes, but allows a String to be tagged with its encoding. It then provides a rich API for converting Strings from one encoding to another.

string = "hello"                     # by default, string is encoded as "ASCII"
string.force_encoding("ISO-8859-1")  # this simply retags the String as ISO-8859-1
                                     # this will work since ISO-8859-1
                                     # is a superset of ASCII.

string.encode("UTF-8")               # this will ask Ruby to convert the bytes in
                                     # the current encoding to bytes in
                                     # the target encoding, and retag it with the
                                     # new encoding
                                     #
                                     # this is usually a lossless conversion, but
                                     # can sometimes be lossy

A more advanced example:

# encoding: UTF-8

# first, tell Ruby that our editor saved the file using the UTF-8 encoding.
# TextMate does this by default. If you lie to Ruby, very strange things
# will happen

utf8 = "hellö"
iso_8859_1 = "hellö".encode("ISO-8859-1")

# Because we specified an encoding for this file, Strings in here default
# to UTF-8 rather than ASCII. Note that if you didn't specify an encoding
# characters outside of ASCII will be rejected by the parser.

utf8 << iso_8859_1

# This produces an error, because Ruby does not automatically try to
# transcode Strings from one encoding into another. In practice, this
# should rarely, if ever happen in applications that can rely on
# Unicode; you'll see why shortly

utf8 << iso_8859_1.encode("UTF-8")

# This works fine, because you first made the two encodings the same

The problems people are really having

The problem of dealing with ISO-8859-1 encoded text and UTF-8 text in the same Ruby is real, and we'll see soon how it is handled in Ruby. However, the problems people have been having are not of this variety.

If you examine virtually all of the bug reports involving incompatible encoding exceptions, you will find that one of the two encodings is ASCII-8BIT. In Ruby, ASCII-8BIT is the name of the BINARY encoding.

So what is happening is that a library somewhere in the stack is handing back raw bytes rather than encoded bytes. For a long time, the likely perpetrator here was database drivers, which had not been updated to properly encode the data they were getting back from the database.

There are several other potential sources of binary data, which we will discuss in due course. However, it's important to note that a BINARY encoded String in Ruby 1.9 is the equivalent of a byte[] in Java. It is a type that cannot be reasonably concatenated onto an encoded String. In fact, it is best to think of BINARY encoded Strings as a different class with many methods in common.

In practice, as Ruby libraries continue to be updated, you should rarely ever see BINARY data inside of your application. If you do, it is because the library that handed it to you genuinely does not know the encoding, and if you want to combine it with non-BINARY String, you will need to convert it into an encoded String manually (using force_encoding).

Why this is, in practice, a rare problem

The problem of incompatible encodings is likely to happen in Western applications only when combining ISO-8859-* data with Unicode data.

In practice, most sources of data, without any further work, are already encoded as UTF-8. For instance, the default Rails MySQL connection specifies a UTF-8 client encoding, so even an ISO-8859-1 database will return UTF-8 data.

Many other data sources, such as MongoDB, only support UTF-8 data internally, so their Ruby 1.9-compatible drivers already return UTF-8 encoded data.

Your text editor (TextMate) likely defaults to saving your templates as UTF-8, so the characters in the templates are already encoded in UTF-8.

This is why Ruby 1.8 had the illusion of working. With the exception of some (unfortunately somewhat common) edge-cases, most of your data is already encoded in UTF-8, so simply treating it as BINARY data, and smashing it all together (as Ruby 1.8 does) works fairly reliably.

The only reason why this came tumbling down in Ruby 1.9 is that drivers that should have returned Strings tagged with UTF-8 were returning Strings tagged with BINARY, which Ruby rightly refused to concatenate with UTF-8 Strings. In other words, the vast majority of encoding problems to date are the result of buggy Ruby libraries.

Those libraries, almost entirely, have now been updated. This means that if you use UTF-8 data sources, which you were likely doing by accident already, everything will continue to work as it did in Ruby 1.8.

Digression: force_encoding

When people encounter this problem for the first time, they are often instructed by otherwise well-meaning people to simply call force_encoding("UTF-8") on the offending String.

This will work reliably if the original data is stored in UTF-8, which is often true about the person who made the original suggestion. However, it will mysteriously fail to work (resulting in "�" characters appearing) if the original data is encoded in ISO-8859-1. This can cause major confusion because some people swear up and down that it's working and others can clearly see that it's not.

Additionally, since ISO-8859-1 and UTF-8 are both compatible with ASCII, if the characters being force_encoded are ASCII characters, everything will appear to work until a non-ASCII character is entered one day. This further complicates efforts of members of the community to identify and help resolve issues if they are not fluent in the general issues surrounding encodings.

I'd note that this particular issue (BINARY data entering the system that is actually ISO-8859-1) would cause similar problems in Java and Python, which would either silently assume Unicode, or present a byte[], forcing you to force_encoding it into something like UTF-8.

Where it doesn't work

Unfortunately, there are a few sources of data that are common in Rails applications that are not already encoded in UTF-8.

In order to identify these cases, we will need to identify the boundary between a Rails application and the outside world. Let's look at a common web request.

First, the user goes to a URL. That URL is probably encoded in ASCII, but can also contain Unicode characters. The encoding for this part of the request (the URI) is not provided by the browser, but it appears safe to assume that it's UTF-8 (which is a superset of ASCII). I have tested in various versions of Firefox, Safari, Chrome, and Internet Explorer and it seems reliable. I personally thank the Encoding gods for that.

Next, the request goes through the Rack stack, and makes its way into the Rails application. If all has gone well, the Rails application will see the parameters and other bits of the URI exposed through the request object encoded as UTF-8. At the moment (and after this post, it will probably be true for mere days), Rack actually returns BINARY Strings for these elements.

At the moment, Ruby allows BINARY Strings that contain only ASCII characters to be concatenated with any ASCII-compatible encoding (such as UTF-8). I believe this is a mistake, because it will make scenarios such as the current state of Rack work in all tested cases, and then mysteriously cause errors when the user enters a UTF-8 character in the URI. I have already reported this issue and it should be fixed in Ruby. Thankfully, this issue only relates to libraries that are mistakenly returning BINARY data, so we can cut this off at the pass by fixing Rack to return UTF-8 data here.

Next, that data will be used to make a request of the data store. Because we are likely using a UTF-8 encoded data-store, once the Rack issue is resolved, the request will go through without incident. If we were using an ISO-8859-1 data store (possible, but unlikely), this could pose issues. For instance, we could be looking up a story by a non-ASCII identifier that the database would not find because the request String is encoded in UTF-8.

Next, the data store returns the contents. Again, you are likely using using a UTF-8 data store (things like CouchDB and MongoDB return Strings as UTF-8). Your template is likely encoded in UTF-8 (and Rails actually makes the assumption that templates without any encoding specified are UTF-8), so the String from your database should merge with your template without incident.

However, there is another potential problem here. If your data source does not return UTF-8 data, Ruby will refuse to concatenate the Strings, giving you an incompatible encoding error (which will report UTF-8 as incompatible with, for instance, ISO-8859-1). In all of the encoding-related bug reports I've seen, I've only ever seen reports of BINARY data causing this problem, again, likely because your data source actually is UTF-8.

Next, you send the data back to the browser. Rails defaults to specifying a UTF-8 character set, so the browser should correctly interpret the String, if it got this far. Note that in Ruby 1.8, if you had received data as ISO-8859-1 and stored it in an ISO-8859-1 database, your users would now see "�", because the browser cannot identify a valid Unicode character for the bytes that came back from the database.

In Ruby 1.9, this scenario (but not the much more common scenario where the database returns content as UTF-8, which is common because Rails specifies a UTF-8 client encoding in the default database.yml), you would receive an error rather than sending corrupted data to the client.

If your page included a form, we now have another potential avenue for problems. This is especially insidious because browsers allow the user to change the "document's character set", and users routinely fiddle with that setting to "fix" pages that are actually encoded in ISO-8859-1, but are specifying UTF-8 as the character set.

Unfortunately, while browsers generally use the document's character set for POSTed form data, this is both not reliable and possible for the user to manually change. To add insult to injury, the browsers with the largest problems in this area do not send a Content-Type header with the correct charset to let the server know the character set of the POSTed data.

Newer standards specify an attribute accept-charset that page authors can add to forms to tell the client what character set to send the POSTed data as, but again, the browsers with the largest issues here are also the ones with issues in implementing accept-charset properly.

The most common scenario where you can see this issue is when the user pastes in content from Microsoft Word, and it makes it into the database and back out again as gibberish.

After a lot of research, I have discovered several hacks that, together, should completely solve this problem. I am still testing the solution, but I believe we should be able to completely solve this problem in Rails. By Rails 3.0 final, Rails application should be able to reliably assume that POSTed form data comes in as UTF-8.

Moving that data to the server presents another potential encoding problem, but again, if we can rely on the database to be using UTF-8 as the client (or internal) encoding, and the solution for POSTed form data pans out, the data should smoothly get into the database as UTF-8.

But what if we still do have non-UTF-8 data

Even with all of this, it is still possible that some non-BINARY data sneaks over the boundary and into our Rails application from a non-UTF-8 source.

For this scenario, Ruby 1.9 provides an option called Encoding.default_internal, which allows the user to specify an preferred encoding for Strings. Ruby itself and Ruby's standard libraries respect this option, so even if, for instance, it opens some IO encoded in ISO-8859-1, it will give the data to the Ruby program transcoded to the preferred encoding.

Libraries, such as database drivers, should also support this option, which means that even if the database is somehow set up to receive UTF-8 String, the driver should convert those String transparently to the preferred encoding before handing it to the program.

Rails can take advantage of this by setting the default_internal to UTF-8, which will then ensure that String from non-UTF-8 sources still make their way into Rails encoded as UTF-8.

Since I started asking libraries to honor this option a week ago, do_sqlite, do_mysql, do_postgres, Nokogiri, Psych (the new YAML parser in Ruby 1.9), sqlite3, and the MongoDB driver have all added support for this option. The fix should be applied to the MySQL driver shortly, and I am still waiting on a response from the pg driver maintainer.

In short, by the time 1.9.2-final ships, I don't see any reason why all libraries in use don't honor this setting.

I'd also add that MongoDB and Nokogiri already return only UTF-8 data, so supporting this option was primarily a matter of correctness. If a driver already deals entirely in UTF-8, it will work transparently with Rails because Rails deals only in UTF-8.

That said, we plan to robustly be able to support scenarios where UTF-8 cannot be used in this way (because encoding are in use that cannot be transparently encoded at the boundary without data loss), so proper support for default_internal will be essential in the long-term.

TL;DR

The vast majority of encoding bugs to date have resulted from outdated drivers that returned BINARY data instead of Strings with proper encoding tags.

The pipeline that brings Strings in and out of Rails is reasonably well-understood, and simply by using UTF-8 libraries for each part of that pipeline, Ruby 1.9 will transparently work.

If you accidentally use non-UTF-8 sources in the pipeline, Ruby 1.9 will throw an error, an improvement over the Ruby 1.8 behavior of simply sending corrupted data to the client.

For this scenario, Ruby 1.9 allows you to specify a preferred encoding, which instructs the non-UTF-8 source to convert Strings in other encodings to UTF-8.

By default, Rails will set this option to UTF-8, which means that you should not see ISO-8859-1 Strings in your Rails application.

By the time Ruby 1.9 is released in a few months, this should be a reality, and your experience dealing with Ruby 1.9 String should be superior to the 1.8 experience, because it should generally work, but libraries will have properly considered encoding issues. This means that serving misencoded data should be basically impossible.

TL;DR the TL;DR

When using Rails 3.0 with Ruby 1.9.2-final, you will generally not have to care about encodings.

Postscript

With all that said, there can be scenarios where you receive BINARY data from a source. This can happen in any language that handles encodings more transparently than Ruby, such as Java and Python.

This is because it is possible for a library to receive BINARY data and not have the necessary metadata to tag it with an encoding.

In this case, you will either need to determine the encoding yourself or treat it as raw BINARY data, and not a String. The reason this scenario is rare is that if there is a way that you can determine the encoding (such as by looking at provided with the bytes), the original library can do the same.

If you get into a scenario where you know the encoding, but it is not machine available, you will want to do something like:

data = API.get("data")

data.encoding #=> ASCII-8BIT # alias for BINARY

data.force_encoding("SHIFT-JIS").encode!

# This first tags the data with the encoding that
# you know it is, and then re-encodes it to
# the default_internal encoding, if one was
# specified