Yehuda Katz is a member of the Ember.js, Ruby on Rails and jQuery Core Teams; his 9-to-5 home is at the startup he founded, Tilde Inc.. There he works on Skylight, the smart profiler for Rails, and does Ember.js consulting. He is best known for his open source work, which also includes Thor and Handlebars. He travels the world doing open source evangelism and web standards work.

Ruby 1.9 Encodings: A Primer and the Solution for Rails

UPDATE: The DataObjects drivers, which are used in DataMapper, are now updated to honor default_internal. Let’s keep this moving.

Since Ruby 1.9 announced support for encodings, there has been a flurry of activity to make existing libraries encoding aware, and a tornado of confusion as users of Ruby and Rails have tried to make sense of it.

In this post, I will lay out the most common problems people have had, and what we can do as a community to put these issues to bed in time for Ruby 1.9.2 final.

A Quick Tour

I’m going to simplify some of this, but the broad strokes are essentially correct.

Before we begin, many of you are probably wonder what exactly an “encoding” is. For me, getting a handle on this was an important part of helping me understand the possible solution space.

On disk, all Strings are stored as a sequence of bytes. An encoding simply specifies how to take those bytes and convert them into “codepoints”. In some languages, such as English, a “codepoint” is exactly equivalent to “a character”. In most other languages, there is not a one-to-one correspondence. For example, a German codepoint might specify that the next codepoint should get an ümlaut.

The list of English characters represented by the first seven bits of ASCII (characters 0 through 127 in “ASCII-7″) have the same representation in many (but not all) encodings. This means that if you only use English characters, the on-disk representation of the characters will often be exactly the same regardless of the source encoding.

However, once you start to use other characters, the bytes on disk mean different things in different encodings. Have you ever seen a page on the Internet filled with something like “Führer”? That is the consequence of the bytes of “Führer” stored as UTF-8 being interpreted as Latin-1.

You can trivially see this problem using Ruby 1.9′s encoding support by running this program:

# encoding: UTF-8
 
puts "hello ümlaut".force_encoding("ISO-8859-1").encode("UTF-8")
 
# Output
# hello ümlat

First, we create a String (“hello ümlaut”) in the UTF-8 encoding. Next, we tell Ruby that the String is actually Latin-1. It’s not, so an attempt to read the characters will interpret the raw bytes of the “ü” as though they were Latin-1 bytes. We ask Ruby to give us that interpretation of the data in UTF-8 via encode and print it out.

We can see that while the bytes for “hello ” and “mlat” were identical in both UTF-8 and Latin-1, the bytes for “ü” in UTF-8 mean “ü” in Latin-1.

Note that while force_encoding simply tags the String with a different encoding, encode converts the bytes of one encoding into the equivalent bytes of the second. As a result, while force_encoding should almost never be used unless you know for sure that the bytes actually represent the characters you want in the target encoding, encode is relatively safe to use to convert a String into the encoding you want.

You’ve probably also seen the reverse problem, where bytes encoded in Latin-1 ended up inside a page encoded in UTF-8.

# encoding: ISO-8859-1
 
puts "hello ümlaut".force_encoding("UTF-8")
 
# Output
# hello ?mlat

Here, the sequence of bytes that represents an “ü” in Latin-1 could not be recognized in UTF-8, so they were replaced with a “?”. Note that puts will always simply write out the bytes to your terminal, and the terminal’s encoding will determine how they are interpreted. The examples in this post are all outputted to a terminal using UTF-8 encoding.

As you can imagine, this presents quite the issue when concatenating two Strings of different encodings. Simply smashing together the raw bytes of the two Strings can result in output that is incomprehensible in either encoding. To make matters worse, it’s not always possible to represent all of the characters in one encoding in another. For instance, the characters of the Emoji encoding cannot be represented in the ISO-8859-1 encoding (or even in a standardized way onto the UTF-8 encoding).

As a result, when you attempt to concatenate two Strings of different encodings in Ruby 1.9, Ruby displays an error.

# encoding: UTF-8
 
puts "hello ümlaut".encode("ISO-8859-1") + "hello ümlaut"
 
# Output
# incompatible character encodings: ISO-8859-1 and UTF-8 (Encoding::CompatibilityError)

Because it’s extremely tricky for Ruby to be sure that it can make a lossless conversion from one encoding to another (Ruby supports almost 100 different encodings), the Ruby core team has decided to raise an exception if two Strings in different encodings are concatenated together.

There is one exception to this rule. If the bytes in one of the two Strings are all under 127 (and therefore valid characters in ASCII-7), and both encodings are compatible with ASCII-7 (meaning that the bytes of ASCII-7 represent exactly the same characters in the other encoding), Ruby will make the conversion without complaining.

# encoding: UTF-8
 
puts "hello umlat".encode("ISO-8859-1") + "hello ümlaut"
 
# Output
# hello umlathello ümlaut

Since Ruby does not allow characters outside of the ASCII-7 range in source files without a declared encoding, this exception eliminates a large number of potential problems that Ruby’s strict concatenation rules might have introduced.

Binary Strings

By default, Strings with no encoding in Ruby are tagged with the ASCII-8BIT encoding, which is an alias for BINARY. Essentially, this is an encoding that simply means “raw bytes here”.

In general, code in Rails applications should not encounter BINARY strings, except for Strings created in source files without encodings. However, since these Strings will virtually always fall under the ASCII-7 exception, Ruby programmers should never have to deal with incompatible encoding exceptions where one of the two encodings is ASCII-8BIT (i.e. BINARY).

That said, almost all of the encoding problems reported by users in the Rails bug tracker involved ASCII-8BIT Strings. How did this happen?

There are two reasons for this.

The first reason is that early on, database drivers generally didn’t properly tag Strings they retrieved from the database with the proper encoding. This involves a manual mapping from the database’s encoding names to Ruby’s encoding names. As a result, it was extremely common from database drivers to return Strings with characters outside of the ASCII-7 range (because the original content was encoded in the database as UTF-8 or ISO-8859-1/Latin-1).

When attempting to concatenate that content onto another UTF-8 string (such as the buffer in an ERB template), Ruby would raise an incompatible encoding exception.

# encoding: UTF-8
 
puts "hello ümlaut" + "hello ümlaut".force_encoding("BINARY")
 
# Output
# incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

This is essentially identical to the scenario many people encountered. A UTF-8 String was presented to Ruby as a BINARY String, since the database driver didn’t tag it. When attempting to concatenate it onto UTF-8, Ruby had no way to do so reliably, so it raised an exception.

One reason that many people didn’t encounter this problem was that either the contents of the template or the text from the database were entirely in the ASCII-7 subset of their character set. As a result, Ruby would not complain. This is deceptive, because if they made a small change to their template, or if a user happened to enter non-ASCII-7 data (for instance, they got their first user named José), they would suddenly start seeing an incompatible encoding exception.

When people see this incompatible encoding exception, one common reaction is to call force_encoding("UTF-8") on the BINARY data. This will work great for Strings whose bytes actually are encoded in UTF-8. However, if people whose Strings were encoded in ISO-8859-5 (Russian) followed this instruction, they would end up with scrambled output.

Additionally, it’s impossible to simply encode the data, since Ruby doesn’t actually know the source encoding. In essence, a crucial piece of information has been lost at the database driver level.

Unfortunately, this means that well-meaning people who have solved their problem by force_encoding their Strings to UTF-8 (because the bytes actually did represent UTF-8 characters) become baffled when their solution doesn’t work for someone working on a Russian website.

Thankfully, this situation is now mostly solved. There are updates for all database drivers that map the encodings from the database to a Ruby encoding, which means that UTF-8 text from the database will be UTF-8 Strings in Ruby, and Latin-1 text from the database will be ISO-8859-1 Strings in Ruby.

Unfortunately, there is a second large source of BINARY Strings in Ruby. Specifically, data received from the web in the form of URL encoded POST bodies often do not specify the content-type of the content sent from forms.

In many cases, browsers send POST bodies in the encoding of the original document, but not always. In addition, some browsers say that they’re sending content as ISO-8859-1 but actually send it in Windows-1251. There is a long thread on the Rack tracker about this, but the bottom line is that it’s extremely difficult to determine the encoding of a POST body sent from the web.

As a result, Rack handlers send the raw bytes through as BINARY (which is reasonable, since handlers shouldn’t be in the business of trying to wade through this problem) and no middleware exists (yet) to properly tag the String with the correct encoding.

This means that if the stars align, the raw bytes are UTF-8, end up in a UTF-8 database, and end up coming back out again tagged as UTF-8. If the stars do not align, the text might actually be encoded in ISO-8859-1, get put into a UTF-8 database, and come out tagged as UTF-8 (and we know what happens when ISO-8859-1 data is mistakenly tagged as UTF-8).

In this case, because the ISO-8859-1 data is improperly tagged as UTF-8, Ruby happily concatenates it with other UTF-8 Strings, and hilarity ensues.

Because English characters have the same byte representation in all commonly used encodings, this problem is not as common as you might imagine. Unfortunately, this simply means that people who do encounter it are baffled and find it hard to get help. Additionally, this problem doesn’t manifest itself as a hard error. it can go unnoticed and dismissed as a minor annoyance if the number of non-ASCII-7 characters are low.

In order to properly solve this problem for Ruby 1.9, we need a very good heuristic for properly determining the encoding of web-sent POST bodies. There are some promising avenues that will get it right 99.9% of the time, and we need to package them into up a middleware that will tag Strings correctly.

Incompatible Encodings

If you’ve been paying attention, you’ve probably noticed that while the database drivers have solved one problem, they actually introduced another one.

Imagine that you’re using a MySQL database encoded in ISO-8859-1 (or ISO-8859-5, popular for Russian applications, or any other non-UTF-8 encoding). Now that the String coming back from the database is properly tagged as ISO-8859-1, Ruby will refuse to concatenate it onto the ERB buffer (which is encoded in UTF-8). Even if we solved this problem for ERB, it could be trivially reintroduced in other parts of the application through regular concatenation (+, concat, or even String interpolation).

Again, this problem is somewhat mitigated due to the ASCII-7 subset exception, which means that as long as one of the two incompatible Strings uses only English characters, users won’t see any problems. Again, because this “solution” means that the Ruby developer in question still may not understand encodings, this simply defers the problem to some uncertain point in the future when they either add a non-ASCII-7 character to their template or the user submits a non-ASCII-7 String.

The Solution

If you got this far, you’re probably thinking “Holy shit this encoding stuff is crazy. I don’t want to have to know any of this! I just want to write my web app!”

And you’d be correct.

Other languages, such as Java and Python, solve this problem by encodeing every String that enters the language as UTF-8 (or UTF-16). Theoretically, it is possible to represent the characters of every encoding in UTF-8. By doing this, programmers only ever deal with one kind of String, and concatenation happens between UTF-8 Strings.

However, this solution does not work very well for the Japanese community. For a variety of complicated reasons, Japanese encoding, such as SHIFT-JIS, are not considered to losslessly encode into UTF-8. As a result, Ruby has a policy of not attempting to simply encode any inbound String into UTF-8.

This decision is debatable, but the fact is that if Ruby transparently transcoded all content into UTF-8, a large portion of the Ruby community would see invisible lossy changes to their content. That part of the community is willing to put up with incompatible encoding exceptions because properly handling the encodings they regularly deal with is a somewhat manual process.

On the other hand, many Rails applications work mostly with encodings that trivially encode to UTF-8 (such as UTF-8 itself, ASCII, and the ISO-8859-1 family). For this rather large part of the community, having to manually encode Strings to solve incompatible encoding problem feels like a burden that belongs on the machine has been inappropriately shifted onto Rails application developers.

But there is a solution.

By default, Ruby should continue to support Strings of many different encodings, and raise exceptions liberally when a developer attempts to concatenate Strings of different encodings. This would satisfy those with encoding concerns that require manual resolution.

Additionally, you would be able to set a preferred encoding. This would inform drivers at the boundary (such as database drivers) that you would like them to convert any Strings that they tag with an encoding to your preferred encoding immediately. By default, Rails would set this to UTF-8, so Strings that you get back from the database or other external source would always be in UTF-8.

If a String at the boundary could not be converted (for instance, if you set ISO-8859-1 as the preferred encoding, this would happen a lot), you would get an exception as soon as that String entered the system.

In practice, almost all usage of this setting would be to specify UTF-8 as a preferred encoding. From your perspective, if you were dealing in UTF-8, ISO-8859-* and ASCII (most Western developers), you would never have to care about encodings.

Even better, Ruby already has a mechanism that is mostly designed for this purpose. In Ruby 1.9, setting Encoding.default_internal tells Ruby to encode all Strings crossing the barrier via its IO system into that preferred encoding. All we’d need, then, is for maintainers of database drivers to honor this convention as well.

It doesn’t require any changes to Ruby itself, and places the burden squarely on the few people who already need to deal with encodings (because taking data from outside of Ruby, via C, always already requires a tagging step). I have spoken with Aaron Patterson, who has been working on the SQLite3 driver, and he feels that this change is simple enough for maintainers of drivers dealing with external Strings to make it a viable option. He has already patched SQLite3 to make it respect default_internal.

However you feel about Ruby’s solution to expose String encodings directly in the language, you should agree that since we’re stuck with it for the forseeable future, this solution shifts the burden of dealing with it from the unwashed masses (most of whom have no idea what an encoding is) to a few maintainers of C extensions and libraries that deal in binary data. Getting this right as soon as possible will substantially ease the transition from Ruby 1.8 to Ruby 1.9.

Postscript: What Happened in 1.8!?

When people first move to 1.9 and encounter these shenanigans, they often wonder why everything seemed so simple in Ruby 1.8, and yet seemed to work.

There are a few reasons for this.

First, keep in mind that in Ruby 1.8, Strings are simple sequences of bytes. Ruby String operations just concatenate those byte sequences together without any kind of check. This means that concatenating two UTF-8 Strings together will just work, since the combined byte sequence is still valid UTF-8. As long as the client for the Ruby code (such as the browser) is told that the bytes are encoded in UTF-8, all is well. Rails does this by setting the default charset for all documents to UTF-8.

Second, Ruby 1.8 has a “UTF-8″ mode that makes its regular expression engine treat all Strings as UTF-8. In this mode (which is triggered by setting $KCODE = “UTF-8″), the regular expression engine correctly matches a complete UTF-8 character for /./, for instance. Rails sets this global by default, so if you were using Rails, regular expressions respect unicode characters, not raw bytes.

Third, very little non-English content in the wild is actually encoded in ISO-8859-1. If you were expecting to deal with content that was not English, you would probably set your MySQL database to use a UTF-8 encoding. Since Rails sets UTF-8 as the charset of outbound documents, most browsers will in fact return UTF-8 encoded data.

Fourth, the problems caused when an ISO-8859-1 String is accidentally concatenated into a UTF-8 String are not as jarring as the errors produced by Ruby 1.9. Let’s try a little experiment. First, open up a text editor, create a new file, and save it in the ISO-8859-1 encoding.

$KCODE = "UTF-8"
 
iso_8859_1 = "ümlaut"
 
# the byte representation of ümlaut in unicode
utf8 = "\xC3\xBCmlat"
 
puts iso_8859_1
puts utf8
 
puts iso_8859_1 + utf8
puts utf8 + iso_8859_1
 
# Output
# ?mlat
# ümlaut
# ?mlatümlaut
# ümlaut?mlat

If you somehow get ISO-8859-1 encoded content that uses characters outside of the ASCII-7 range, Ruby doesn’t puke. Instead, it simply replaces the unidentified character with a “?”, which can easily go unnoticed in a mostly English site with a few “José”s thrown into the mix. It could also easily be dismissed as a “weird bug that we don’t have time to figure out right now”.

Finally, Rails itself provides a pure-Ruby UTF-8 library that mops up a lot of the remaining issues. Specifically, it provides an alternate String class that can properly handle operations like split, truncate, index, justify and other operations that need to operate on characters, not bytes. It then uses this library internally in helpers like truncate, transparently avoiding a whole other class of issue.

In short, if you’re dealing mostly with English text, and you get unlucky enough the get ISO-8859-1 input from somewhere, the worst case is that you get a “?” instead of a “é”. If you’re dealing with a lot of non-English text, you’re probably being not using ISO-8859-1 sources. In either case, English (ASCII) text is compatible with UTF-8, and Rails provides solid enough pure-Ruby UTF-8 support to get you most of the rest of the way.

That said, anyone dealing with encodings other than UTF-8 and ISO-8859-1 (Japanese and Russian Rubyists) were definitely not in a good place with Ruby 1.8.

Thanks

I want to personally thank Jay Freeman (aka saurik), who in addition to being a general badass, spent about 15 hours with me patiently explaining these issues and working through the Ruby 1.9 source to help fully understand the tradeoffs available.

30 Responses to “Ruby 1.9 Encodings: A Primer and the Solution for Rails”

Excellent article, thanks Yehuda! The database adapter encoding issue has prevented me from moving to 1.9.

Good explanation. Unfortunately, I’d already had to figure out way too much of this on my own from running Rails apps in 1.9.1 and hunting down these exact errors. >8-> (I knew in my case it was all browser issues, because I was using MongoMapper for the data layer and MongoDB is _only_ UTF-8 aware.)

One other observation from my experience: it isn’t just non-English that causes problems. I work for an academic society, and our largest Rails app thus far was a submissions system for our paper proposals. Comparative theology scholars don’t trust their Web browsers nearly as much as they trust Microsoft Word. It was more common than not for them to paste in their entire proposals from a Word document — which meant that several flavors of “smart” quotes, long dashes, ellipses characters, and other autocorrected Word-isms showed up all over the place, and created plenty of encoding issues without the users knowing they were entering anything odd. It also interfered with the Markdown formatting we already had in place, as did Word’s tendency to indent every paragraph by several spaces. I had to write some patch layers on top of RDiscount to fix the parts that were confusing it.

We also had plenty of genuine international or special character needs (e.g., a paper in the Yogācāra Studies Consultation entitled “The Ālaya-vijñāna Discussion in the Viniṣcayasaṃgrahaṇī of the Yogācarabhūmi,” and another titled “Tweet if U ♥ Jesus: Spiritual Authority, Identity and Community in the Digital Reformation”) and if it weren’t for the incredible flexibility of UTF-8 in Ruby 1.9, we’d have had a lot more problems than we did.

Yehuda,

I get a lot of those problems using Ruby 1.9.1, I ended up switching to Ruby 1.8 so I could deploy my app.

As you mentioned a lot, José’s name, you must have worked this out for the Portuguese (ISO-8859-1) language.

What should one do if you are mainly working on ISO-8859-1 apps, to avoid those exceptions?! #enconding: ISO-8859-1 at every ruby source that needs “é” characters? What about ERB?

[]‘s

Yehuda,
Good article, I’ve been dealing with these issues a lot lately. Where can I find some of the heuristic’s that will let me detect what encoding data from forms is coming in as? You mentioned in the article that there is something that will get it right 99% of the time.

Thanks,
Ryan

Minor nit pick: despite it’s an ISO standard, ISO-8859-5 is largely unused in Russia. People here mostly use windows-1251 or utf-8 instead.

A nice summary of the issue Yehuda.

In theory I like the idea of Ruby supporting multiple encodings, but I’ve run into too many surprise exceptions to be comfortable with the current implementation.

A way does need to be found to help the common case (UTF-8 everywhere) slightly less bumpy. I like the idea of DB drivers respecting Encoding.default_internal, I think it would help the situation a lot.

Thanks for a nice explanation!

I was mostly aware of these things since we ported our Rails app to Ruby 1.9.1. It has been working without any encoding issues for over half-year. The content is mostly UTF-8 (database and templates), with some occasional conversion to ISO-8859-1.
All we had to do was using encoding-aware Mysql driver (http://github.com/lsegal/mysql-ruby) and a couple of force_encoding patches to ActionView and ActionController.

> Instead, it simply replaces the unidentified character with a “?”
This is incorrect.
Ruby (both of 1.8 and 1.9) don’t replace anything silently.
Your terminal emulator *displays* “?” instead of an invalid character.

“This will work great for Strings whose bytes actually are encoded in UTF-8.”

I guess you mean characters, bytes don’t need an encoding :)

I should have said “strings whose bytes represent UTF8 characters”

Thank you for the heads up. I always like knowing issues are comming before the bug happens.

With that in mind, is there a way to override ruby’s ASCII-7 exception while running tests. I can envision adding characters that trigger the exception to my future tests, but it would be nice not to have to rewrite everything.

Interesting read.. I was wondering why Ruby didn’t go the ‘just convert everything to UTF-8′ route, but the Japanese problem explains it. No biggie. Having Ruby respect encodings is a blessing for those of us who have to deal with anything but English.

People should be careful with assuming ISO-8859-1 for Latin text, though. In Europe, it will quite often actually be ISO-8859-15 since ISO-8859-1 lacks a Euro symbol and some other glyphs like fractions and daggers that you might encounter when processing external text. ISO-8859-15 corrects this. This also matches the Windows-1252 encoding a bit better (well, at least the Euro symbol is in the same place).

Here’s an interesting read for those new to text encodings: http://www.joelonsoftware.com/articles/Unicode.html

I highly recommend this article: “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” to understand unicode stuff. http://joelonsoftware.com/articles/Unicode.html

ISO-8859-5 was never popular in Russia. it was a pretty common error in early days of linux – it was selected as default codepage for Russia, because some western developer thought it’s our standard. And it was really annoying. I’ve heard Bulgarian and Serbian unixes used it in early days, but not sure.

in early days of russian segment of Internet, most popular encoding for webpages and email was KOI-8. it was almost unversally replaced with windows-1251 in 200x, and in last 3-5 years UTF8 becomes more and more common, even on websites made on nonunicode aware languages, such as PHP, and, sadly, Ruby.

I can remember I was frustrated and very afraid of encoding problem, when I switched from perl5.8 (wich has, IMO, really good and solid unicode support) to ruby. It occurs, in most situation it wasn’t problem at all, unless you need regexps. And some of thoose problems could be avoided with activesupport’s multibyte support and Oniguruma gem for regexen.

nowadays, most common for rails developers is to use utf8 for russian both in templates and in database, with some legacy support for cp-1251 in cyrillic urls, entered on windows machines, but not very often.

Today, some russian developers still think it’s crazy to use Ruby1.9, exactly because of overcomplicated encoding support, but I hope future isn’t that dark

Thanks for the nice writeup!

Maybe someone should fork ruby 1.9 and create one that *only* supports BINARY and UTF 8 or UTF 16.

To avoid confusion :)

-rp

I think it’s fairly common english nativ speaking uber hackers (note: uber is an import of the word über, which lost its ‘ü’ probably to enconding issues ;) ) are unaware of enconding problems. Nice to see you fixed this.

Great article Yehuda. Encoding is the biggest change in Ruby 1.9 and it is always nice to see it explained.
The database solution is ok and I think it is easy to implement. But how about the POST problem? Is there some workaround that we can use today with Rails? Is there some way to specify the encoding of a form and get this information interpreted by Rails 3? I tried accept and accept-encoding, but none of them seemed to work…

Thank you so much for this article, i was very happy to read it, i faced all those problems when i tried to test my application using ruby 1.9 and it took me a while to work around them. at our team, we would like to know if there are any hopes for rails 3 to ease up those problems.
Thanks again

Yep! As I said in the article, Rails should handle the vast majority of these issues for you :)

In the real world things are much more complicated than what’s presented here, using Ruby 1.9.1, Rails 2.3.8 and various gems.

The only solution that I have found to work 100% is a mix of magic comments, String#force_encoding and setting Encoding.default_external.

hi yehuda
can you explain me how to use hebrew characters with ruby 1.9.2,
thanks

The encoding stuff bores me to no ends and is total crap anyway.

I want the default behaviour of Ruby 1.8.x

I don’t care about the change in 1.9

I am not japanese, I don’t even use UTF-8

Why am I forced to deal with encoding in Ruby 1.9 now???

Hi Yehuda,
I have been searching for this from last so many days, Thanks for such a great article.
I have come across some other issue, my application has admin side and client side both using the same database, Database contains the name like “Crème Fraîche”.
When I do force encoding like r.name.force_encoding(“ISO-8859-1″).encode(“UTF-8″)
admin side it displays the data properly but on client side it gives error “incompatible character encodings: ISO-8859-1 and UTF-8″

Before doing force encoding when I see the name in logs they are proper both admin as well as client side,

Am I missing something on client side, some other change might be required, please help I have been seeing for this problem from last 5 days.

“For a variety of complicated reasons, Japanese encoding, such as SHIFT-JIS, are not considered to losslessly encode into UTF-8.”

What are the complicated reasons? I can’t find references to them. They appear to be the root cause of a major “configuration over convention” choice which runs counter to core Ruby philosophy.

I’d also love to see a reference for the part about SHIFT-JIS to UTF-8 being lossy, as I searched and couldn’t find anything (but perhaps that’s because I can’t speak japanese ;)).

Thank you for this great article.

I think that ruby should add an encoding detection method. Most Arabic websites uses windows-1256. Some websites read about the benefits of UTF-8 and tried to switch to it… Can you guess what they did? They just changed the encoding in their html tag. It worked for websites with static content, but failed for the other websites which used database content.

Some were luck to revert back before new data got it’s way into the database (that’s because they saw weird text all over the place when opening their websites), others were not. Consequently, now support forums are filled with help requests asking about how to correct this kind of errors.

In these situations, an encoding detection method will be AWESOME! Writing such functionality might be impossible, but I think there should be something which can help in these situations.

Just for the sanity of those visiting this blog article in hopes of dealing with MySQL Ascii8 & UTF-8, I found that the Mysql2 Gem actually solves all of the issues I had when using the default Mysql gem for ruby-mysql interactions. Give that a try if you get odd UnknownConversion errors :)

Please consider clarifying, or giving a source that explains, the complicated reasons of UTF-8 not mapping perfectly to SHIFT-JIS that you mentioned.

Hi there! Thanks for the great article :)

I have a little question. I’m trying to deal with ASCII-8BIT (as it detected), coming from TCP socket with Russian symbols like \xD1, \xD0 etc., but get scrambled output when force_encoding(‘UTF-8′) on the BINARY data, exactly like you say. What I can to do, to get normal UTF-8 output when I send this symbols over telnet? I’m talking about TCPServer from stdlib.

The reason the Japanese don’t like simply converting everything to Unicode is because of a process adopted by the Unicode consortium called Han Unification. Unicode encodes abstract “glyphs” rather than concrete “graphemes”. For example, the letter ‘a’ can be written as a circle with a line to the right, or it can have an extra curl at the top. These are two graphemes of the same glyph, so there is only one Unicode code point for both, the difference is in the font that renders it.

Chinese, Japanese, traditional Korean and to some extent Vietnamese all use Han characters (known as Chinese characters, Hanzi or Kanji). Often the same glyph is used in all four languages, but with small regional variants in the graphemes. So to correctly display a Japanese text encoded in Unicode, you need to use a font that’s specific for Japanese, or you might end up seeing Chinese variants for instance. These variations might seem very small to a western eye, but they are considered an important part of these nations’ culture and history.

The Japanese especially have always strongly opposed Han unification. It is however possible to do a lossless round trip convertion from JIS to UTF-8 and back, in fact Unicode has several code points that were added specifically to make these kind of round trip conversion lossless.

Leave a Reply

Archives

Categories

Meta