Yehuda Katz is a member of the Ruby on Rails core team, and lead developer of the Merb project. He is a member of the jQuery Core Team, and a core contributor to DataMapper. He contributes to many open source projects, like Rubinius and Johnson, and works on some he created himself, like Thor.

@garybernhardt @steveklabnik I guess I have to ask… is cls a troll project?

Archive for August, 2010

A Tale of Abort Traps (or Always Question Your Assumptions)

For a few months now, the bundler team has been getting intermittent reports of segfaults in C extensions that happen when using bundler with rvm. A cursory investigation revealed that the issue was that the C extensions were compiled for the wrong version of Ruby.

For instance, we would get reports of segfaults in nokogiri when using Ruby 1.9 that resulted from the $GEM_HOME in Ruby 1.9 containing a .bundle file compiled for Ruby 1.8. We got a lot of really angry bug reports, and a lot of speculation that we were doing something wrong that could be obviously fixed.

I finally ran into the issue myself, on my own machine a couple days ago, and tracked it down, deep into the fires of Mount Doom. A word of warning: this story may shock you.

It Begins

I usually use rvm for my day-to-day work, but I tend to not use gemsets unless I want a guaranteed clean environment for debugging something. Bundler takes care of isolation, even in the face of a lot of different gems mixed together in a global gemset. Last week, I was deep in the process of debugging another issue, so I was making pretty heavy use of gemset (and rvm gemset empty).

I had switched to Ruby 1.9.2 (final, which had just come out), created a new gemset, and run bundle install on the project I was working on, which included thin in the Gemfile. I had run bundle install to install the gems in the project, and among other things, bundler installed thin “with native extensions”. I use bundler a lot, and I had run this exact command probably thousands of times.

This time, however, I got a segfault in Ruby 1.9, that pointed at the require call to the rubyeventmachine.bundle file.

Debugging

Now that I had the bug on a physical machine, I started debugging. Our working hypothesis was that bundler or Rubygems was somehow compiling gems against Ruby 1.8, even when on Ruby 1.9, but we couldn’t figure out exactly how it could be happening.

The first thing I did was run otool -L on the binary (rubyeventmachine.bundle):

$ otool -L rubyeventmachine.bundle 
rubyeventmachine.bundle:
  /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/libruby.1.dylib
  (compatibility version 1.8.0, current version 1.8.7)
  /usr/lib/libssl.0.9.8.dylib (compatibility version 0.9.8, current version 0.9.8)
  /usr/lib/libcrypto.0.9.8.dylib (compatibility version 0.9.8, current version 0.9.8)
  /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.3)
  /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 125.0.1)
  /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.9.0)

So that’s weird. I had run bundle install from inside an rvm gemset, yet it was being compiled against the libruby that ships with OSX. I had always suspecting that the problem was a leak from another rvm-installed Ruby, so this was definitely a surprise.

Just out of curiosity, I ran which bundle:

$ which bundle
/usr/bin/bundle

Ok, now I knew something was rotten. I printed out the $PATH:

$ echo $PATH
/Users/wycats/.rvm/gems/ruby-1.9.2-p0/bin:...
$ ls /Users/wycats/.rvm/gems/ruby-1.9.2-p0/bin
asdf*             erubis*           rake2thor*        sc-build-number*
autospec*         prettify_json.rb* redcloth*         sc-docs*
bundle*           rackup*           ruby-prof*        sc-gen*
edit_json.rb*     rake*             sc-build*         sc-init*

In other words, a big fat WTF.

I asked around, and some people had the vague idea that there was a $PATH cache in Unix shells. Someone pointed me at this post about clearing the cache.

Sure enough, running hash -r fixed the output of which. I alerted Wayne of rvm to this problem, and he threw in a fix to rvm that cleared the cache when switching rvms. Momentarily, everything seemed fine.

Digging Further

I still didn’t exactly understand how this condition could happen in the first place. When I went digging, I discovered that shells almost uniformly clear the path cache when modifying the $PATH. Since rvm pushes its bin directory onto the $PATH when you switch to a different rvm, I couldn’t understand how exactly this problem was happening.

I read through Chapter 3 of the zsh guide, and it finally clicked:

The way commands are stored has other consequences. In particular, zsh won’t look for a new command if it already knows where to find one. If I put a new ls command in /usr/local/bin in the above example, zsh would continue to use /bin/ls (assuming it had already been found). To fix this, there is the command rehash, which actually empties the command hash table, so that finding commands starts again from scratch. Users of csh may remember having to type rehash quite a lot with new commands: it’s not so bad in zsh, because if no command was already hashed, or the existing one disappeared, zsh will automatically scan the path again; furthermore, zsh performs a rehash of its own accord if $path is altered. So adding a new duplicate command somewhere towards the head of $path is the main reason for needing rehash.

By using the hash command (which prints out all the entries in this cache), I was able to confirm that the same behavior exists in bash, but it seems that the which command (which I was using for testing the problem) implicitly rehashes in bash.

During this time, I also took a look at /usr/bin/bundle, which looks like this:

$ cat /usr/bin/ruby
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'bundler' is installed as part of a gem, and
# this file is here to facilitate running it.
#
 
require 'rubygems'
 
version = ">= 0"
 
if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end
 
gem 'bundler', version
load Gem.bin_path('bundler', 'bundle', version)

As you can see, executable wrappers created by Rubygems hardcode the version of Ruby that was used to install them. This is one of the reasons that rvm keeps its own directory for installed executables.

A Working Hypothesis

It took me some time to figure all this out, during which time I was speaking to a bunch of friends and the guys in #zsh. I formed a working hypothesis. I figured that people were doing something like this:

  1. Install bundler onto their system, and use it there
  2. Need to work on a new project, so switch to rvm, and create a new gemset
  3. Run bundle install, resulting in an error
  4. Run gem install bundler, to install bundler
  5. Run bundle install, which works, but has a subtle bug

Let’s walk through each of these steps, and unpack exactly what happens.

Install bundler on their system

This results in an executable at /usr/bin/bundle that looks like this:

#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'bundler' is installed as part of a gem, and
# this file is here to facilitate running it.
#
 
require 'rubygems'
 
version = ">= 0"
 
if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end
 
gem 'bundler', version
load Gem.bin_path('bundler', 'bundle', version)

There are two important things here. First, it hardcodes the shebang to the system Ruby location. Second, it uses Rubygems to look up the location of the executable shipped with bundler, which will respect GEM_HOME.

Switch to rvm, and create a new gemset

Switching to an rvm gemset does a few relevant things. First, it prepends ~/.rvm/gems/ruby-1.9.2-p0@gemset/bin onto the $PATH. This effectively resets the shell’s built-in command cache. Second, it sets $GEM_HOME to ~/.rvm/gems/ruby-1.9.2-p0. The $GEM_HOME is where Ruby both looks for gems as well as where it installs gems.

Run bundle install, resulting in an error

Specifically,

$ bundle install
/Library/Ruby/Site/1.8/rubygems.rb:777:in `report_activate_error': 
Could not find RubyGem bundler (>= 0) (Gem::LoadError)
	from /Library/Ruby/Site/1.8/rubygems.rb:211:in `activate'
	from /Library/Ruby/Site/1.8/rubygems.rb:1056:in `gem'
	from /usr/bin/bundle:18

Most people don’t take such a close look at this error, interpreting it as the equivalent of command not found: bundle. What’s actually happening is a bit different. Since bundle is installed at /usr/bin/bundle, the shell finds it, and runs it. It uses the system Ruby (hardcoded in its shebang), and the $GEM_HOME set by rvm. Since I just created a brand new gemset, the bundler gem is not found in $GEM_HOME. As a result, the line gem 'bundler', version in the executable fails with the error I showed above.

However, because the shell found the executable, it ends up in the shell’s command cache. You can see the command cache by typing hash. In both bash and zsh, the command cache will include an entry for bundle pointing at /usr/bin/bundle. zsh is a more aggressive about populating the cache, so you’ll see the list of all commands in the system in the command cache (which, you’ll recall, was reset when you first switched into the gemset, because $PATH was altered).

Run gem install bundler, to install bundler

This will install the bundle executable to ~/.rvm/gems/ruby-1.9.2-p0@gemset/bin, which is the first entry on the $PATH. It will also install bundler to the $GEM_HOME.

Run bundle install, which works, but has a subtle bug

Here’s where the fun happens. Since we didn’t modify the cache, or call hash -r, the next call to bundle still picks up the bundle in /usr/bin/bundle, which is hardcoded to use system Ruby. However, it will use the version of Bundler we just installed, since it was installed to $GEM_HOME, and Rubygems uses that to look for gems. In other words, even though we’re using system Ruby, the /usr/bin/bundle executable will use the rvm gemset’s Bundler gem we just installed.

Additionally, because $GEM_HOME points to the rvm gemset, bundler will install all its gems to the rvm gemset. Taken together, these factors make it almost completely irrelevant that the system Ruby was used to install gems. After all, who cares which Ruby installed the gems, as long as they end up in the right place for the rvm’s Ruby.

There are actually two problems. First, as we’ve seen, the version of Ruby that installs a gem also hardcodes itself into the shebang of the executables for the gem.

Try this experiment:

$ rvm use 1.9.2
$ rvm gemset create experiment
$ rvm gemset use experiment
$ /usr/bin/gem install rack
$ cat ~/.rvm/gems/ruby-1.9.2-p0@experiment/bin/rackup
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'rack' is installed as part of a gem, and
# this file is here to facilitate running it.
#
 
require 'rubygems'
 
version = ">= 0"
 
if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end
 
gem 'rack', version
load Gem.bin_path('rack', 'rackup', version)

As a result, any gems that bundler installed will put their executables in the gemset’s location, but still be hardcoded to use the system’s Ruby. This is not actually the problem we’re encountering here, but it could cause some weird results.

More concerningly, since our system Ruby installed the gems, it will also link any C extensions it compiles against its own copy of libruby. This shouldn’t be a major issue if you’re using a version of Ruby 1.8 in rvm, but it is a major issue if you’re using Ruby 1.9.

Now, we have a gem stored in your rvm’s Ruby that has a .bundle in it that is linked against the wrong Ruby. When running your application, Ruby 1.9 will try to load it, and kaboom: segfault.

Postscript

The crazy thing about this story is that it’s a lot of factors conspiring to cause problems. The combination of the shell command cache, $GEM_HOME making it look like SYSTEM Ruby was doing the right thing, and hardcoding the version of Ruby in the shebang line of installed gems made this segfault possible.

Thankfully, now that we figured the problem out, the latest version of rvm fixes it. The solution: wrap the gem command in a shell function that calls hash -r afterward.

Amazing.

Using >= Considered Harmful (or, What’s Wrong With >=)

TL;DR Use ~> instead.

Having spent far, far too much time with Rubygems dependencies, and the problems that arise with unusual combinations, I am ready to come right out and say it: you basically never, ever want to use a >= dependency in your gems.

When you specify a dependency for your gem, it should mean that you are fairly sure that the unmodified code in the released gem will continue to work with any future version of the dependency that matches the version you specified. So for instance, let’s take a look at the dependencies listed in the actionpack gem:

activemodel (= 3.0.0.rc, runtime)
activesupport (= 3.0.0.rc, runtime)
builder (~> 2.1.2, runtime)
erubis (~> 2.6.6, runtime)
i18n (~> 0.4.1, runtime)
rack (~> 1.2.1, runtime)
rack-mount (~> 0.6.9, runtime)
rack-test (~> 0.5.4, runtime)
tzinfo (~> 0.3.22, runtime)

Since we release the Rails gems as a unit, we declare hard dependencies on activemodel and activesupport. We declare soft dependencies on builder, erubis, i18n, rack, rack-mount, rack-test, and tzinfo.

You might not know what exactly the ~> version specifier means. Essentially, it decomposes into two specifiers. So ~> 2.1.2 means >= 2.1.2, < 2.2.0. In other words, it means “2.1.x, but not less than 2.1.2″. Specifying ~> 1.0, like many people do for Rack, means “any 1.x”.

You should make your dependencies as soft as the versioning scheme and release practices of your dependencies will allow. If you’re monkey-patching a library outside of its public API (not a very good practice for libraries), you should probably stick with an = dependency.

One thing for certain though: you cannot be sure that your gem works with every future version of your dependencies. Sanely versioned gems take the opportunity of a major release to break things, and until you have actually tested against the new versions, it’s madness to claim compatibility. One example: a number of gems have dependencies on activesupport >= 2.3. In a large number of cases, these gems do not work correctly with ActiveSuport 3.0, since we changed how components of ActiveSupport get loaded to make it easier to cherry-pick.

Now, instead of receiving a version conflict, users of these gems will get cryptic runtime error messages. Even worse, everything might appear to work, until some weird edge-case is exercised in production, and which your tests would have caught.

But What Happens When a New Version is Released?

One reason that people use the activesupport >= 2.3 is that, assuming Rails maintains backward-compatibility, their gem will continue to work in newer Rails environments without any difficulty. If everything happens to work, it saves you the time of running their unit tests against newer versions of dependencies and cutting a new release.

As I said before, this is a deadly practice. By specifying appropriate dependencies (based on your confidence in the underlying library’s versioning scheme), you will have a natural opportunity to run your test suite against the new versions, and release a new gem that you know actually works.

This does mean that you will likely want to release patch releases of old versions of your gem. For instance, if I have AuthMagic 1.0, which worked against Rails 2.3, and I release AuthMagic 2.0 once Rails 3.0 comes out, it makes sense to continue patching AuthMagic 1.0 for a little while, so your Rails 2.3 users aren’t left out in the cold.

Applications and Gemfiles

I should be clear that this versioning advice doesn’t necessarily apply to an application using Bundler. That’s because the Gemfile.lock, which you should check into version control, essentially converts all >= dependencies into hard dependencies. However, because you may want to run bundle update at some point in the future, which will update everything to the latest possible versions, you might want to use version specifiers in your Gemfile that seem likely to work into the future.

Threads (in Ruby): Enough Already

For a while now, the Ruby community has become enamored in the latest new hotness, evented programming and Node.js. It’s gone so far that I’ve heard a number of prominent Rubyists saying that JavaScript and Node.js are the only sane way to handle a number of concurrent users.

I should start by saying that I personally love writing evented JavaScript in the browser, and have been giving talks (for years) about using evented JavaScript to sanely organize client-side code. I think that for the browser environment, events are where it’s at. Further, I don’t have any major problem with Node.js or other ways of writing server-side evented code. For instance, if I needed to write a chat server, I would almost certainly write it using Node.js or EventMachine.

However, I’m pretty tired of hearing that threads (and especially Ruby threads) are completely useless, and if you don’t use evented code, you may as well be using a single process per concurrent user. To be fair, this has somewhat been the party line of the Rails team years ago, but Rails has been threadsafe since Rails 2.2, and Rails users have been taking advantage of it for some time.

Before I start, I should be clear that this post is talking about requests that spent a non-tiny amount of their time utilizing the CPU (normal web requests), even if they do spend a fair amount of time in blocking operations (disk IO, database). I am decidedly not talking about situations, like chat servers where requests sit idle for huge amounts of time with tiny amounts of intermittent CPU usage.

Threads and IO Blocking

I’ve heard a common misperception that Ruby inherently “blocks” when doing disk IO or making database queries. In reality, Ruby switches to another thread whenever it needs to block for IO. In other words, if a thread needs to wait, but isn’t using any CPU, Ruby’s built-in methods allow another waiting thread to use the CPU while the original thread waits.

If every one of your web requests uses the CPU for 30% of the time, and waits for IO for the rest of the time, you should be able to serve three requests in parallel, coming close to maxing out your CPU.

Here’s a couple of diagrams. The first shows how people imagine requests work in Ruby, even in threadsafe mode. The second is how an optimal Ruby environment will actually operate. This example is extremely simplified, showing only a few parts of the request, and assuming equal time spent in areas that are not necessarily equal.


Untitled.001.png


Untitled.002.png


I should be clear that Ruby 1.8 spends too much time context-switching between its green threads. However, if you’re not switching between threads extremely often, even Ruby 1.8′s overhead will amount to a small fraction of the total time needed to serve a request. A lot of the threading benchmarks you’ll see are testing pathological cases involve huge amounts of threads, not very similar to the profile of a web server.

(if you’re thinking that there are caveats to my “optimal Ruby environment”, keep reading)

“Threads are just HARD”

Another common gripe that pushes people to evented programming is that working with threads is just too hard. Working hard to avoid sharing state and using locks where necessary is just too tricky for the average web developer, the argument goes.

I agree with this argument in the general case. Web development, on the other hand, has an extremely clean concurrency primitive: the request. In a threadsafe Rails application, the framework manages threads and uses an environment hash (one per request) to store state. When you work inside a Rails controller, you’re working inside an object that is inherently unshared. When you instantiate a new instance of an ActiveRecord model inside the controller, it is rooted to that controller, and is therefore not used between live threads.

It is, of course, possible to use global state, but the vast majority of normal, day-to-day Rails programming (and for that matter, programming in any web framework in any language with a request model) is inherently threadsafe. This means that Ruby will transparently handle switching back and forth between active requests when you do something blocking (file, database, or memcache access, for instance), and you don’t need to personally manage the problems the arise when doing concurrent programming.

This is significantly less true about applications, like chat servers, that keep open a huge number of requests. In those cases, a lot of the application logic happens outside the individual request, so you need to personally manage shared state.

Historical Ruby Issues

What I’ve been talking about so far is how stock Ruby ought to operate. Unfortunately, a group of things have historically conspired to make Ruby’s concurrency story look much worse than it actually ought to be.

Most obviously, early versions of Rails were not threadsafe. As a result, all Rails users were operating with a mutex around the entire request, forcing Rails to behave like the first “Imagined” diagram above. Annoyingly, Mongrel, the most common Ruby web server for a few years, hardcoded this mutex into its Rails handler. As a result, if you spun up Rails in “threadsafe” mode a year ago using Mongrel, you would have gotten exactly zero concurrency. Also, even in threadsafe mode (when not using the built-in Rails support) Mongrel spins up a new thread for every request, not exactly optimal.

Second, the most common database driver, mysql is a very poorly behaved C extension. While built-in I/O (file or pipe access) correctly alerts Ruby to switch to another thread when it hits a blocking region, other C extensions don’t always do so. For safety, Ruby does not allow a context switch while in C code unless the C code explicitly tells the VM that it’s ok to do so.

All of the Data Objects drivers, which we built for DataMapper, correctly cause a context switch when entering a blocking area of their C code. The mysqlplus gem, released in March 2009, was designed to be a drop-in replacement for the mysql gem, but fix this problem. The new mysql2 gem, written by Brian Lopez, is a drop-in replacement for the old gem, also correctly handles encodings in Ruby 1.9, and is the new default MySQL driver in Rails.

Because Rails shipped with the (broken) mysql gem by default, even people running on working web servers (i.e. not mongrel) in threadsafe mode would have seen a large amount of their potential concurrency eaten away because their database driver wasn’t alerting Ruby that concurrent operation was possible. With mysql2 as the default, people should see real gains on threadsafe Rails applications.

A lot of people talk about the GIL (global interpreter lock) in Ruby 1.9 as a death knell for concurrency. For the uninitiated, the GIL disallows multiple CPU cores from running Ruby code simultaneously. That does mean that you’ll need one Ruby process (or thereabouts) per CPU core, but it also means that if your multithreaded code is running correctly, you should need only one process per CPU core. I’ve heard tales of six or more processes per core. Since it’s possible to fully utilize a CPU with a single process (even in Ruby 1.8), these applications could get a 4-6x improvement in RAM usage (depending on context-switching overhead) by switching to threadsafe mode and using modern drivers for blocking operations.

JRuby, Ruby 1.9 and Rubinius, and the Future

Finally, JRuby already runs without a global interpreter lock, allowing your code to run in true parallel, and to fully utilize all available CPUs with a single JRuby process. A future version of Rubinius will likely ship without a GIL (the work has already begun), also opening the door to utilizing all CPUs with a single Ruby process.

And all modern Ruby VMs that run Rails (Ruby 1.9′s YARV, Rubinius, and JRuby) use native threads, eliminating the annoying tax that you need to pay for using threads in Ruby 1.8. Again, though, since that tax is small relative to the time for your requests, you’d likely see a non-trivial improvement in latency in applications that spend time in the database layer.

To be honest, a big part of the reason for the poor practical concurrency story in Ruby has been that the Rails project didn’t take it seriously, which it difficult to get traction for efforts to fix a part of the problem (like the mysql driver).

We took concurrency very seriously in the Merb project, leading to the development of proper database drivers for DataMapper (Merb’s ORM), and a top-to-bottom understanding of parts of the stack that could run in parallel (even on Ruby 1.8), but which weren’t. Rails 3 doesn’t bring anything new to the threadsafety of Rails itself (Rails 2.3 was threadsafe too), but by making the mysql2 driver the default, we have eliminated a large barrier to Rails applications performing well in threadsafe mode without any additional research.

UPDATE: It’s worth pointing to Charlie Nutter’s 2008 threadsafety post, where he talked about how he expected threadsafe Rails would impact the landscape. Unfortunately, the blocking MySQL driver held back some of the promise of the improvement for the vast majority of Rails users.