A Tale of Abort Traps (or Always Question Your Assumptions)

For a few months now, the bundler team has been getting intermittent reports of segfaults in C extensions that happen when using bundler with rvm. A cursory investigation revealed that the issue was that the C extensions were compiled for the wrong version of Ruby.

For instance, we would get reports of segfaults in nokogiri when using Ruby 1.9 that resulted from the $GEM_HOME in Ruby 1.9 containing a .bundle file compiled for Ruby 1.8. We got a lot of really angry bug reports, and a lot of speculation that we were doing something wrong that could be obviously fixed.

I finally ran into the issue myself, on my own machine a couple days ago, and tracked it down, deep into the fires of Mount Doom. A word of warning: this story may shock you.

It Begins

I usually use rvm for my day-to-day work, but I tend to not use gemsets unless I want a guaranteed clean environment for debugging something. Bundler takes care of isolation, even in the face of a lot of different gems mixed together in a global gemset. Last week, I was deep in the process of debugging another issue, so I was making pretty heavy use of gemset (and rvm gemset empty).

I had switched to Ruby 1.9.2 (final, which had just come out), created a new gemset, and run bundle install on the project I was working on, which included thin in the Gemfile. I had run bundle install to install the gems in the project, and among other things, bundler installed thin "with native extensions". I use bundler a lot, and I had run this exact command probably thousands of times.

This time, however, I got a segfault in Ruby 1.9, that pointed at the require call to the rubyeventmachine.bundle file.

Debugging

Now that I had the bug on a physical machine, I started debugging. Our working hypothesis was that bundler or Rubygems was somehow compiling gems against Ruby 1.8, even when on Ruby 1.9, but we couldn't figure out exactly how it could be happening.

The first thing I did was run otool -L on the binary (rubyeventmachine.bundle):

$ otool -L rubyeventmachine.bundle 
rubyeventmachine.bundle:
  /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/libruby.1.dylib
  (compatibility version 1.8.0, current version 1.8.7)
  /usr/lib/libssl.0.9.8.dylib (compatibility version 0.9.8, current version 0.9.8)
  /usr/lib/libcrypto.0.9.8.dylib (compatibility version 0.9.8, current version 0.9.8)
  /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.3)
  /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 125.0.1)
  /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.9.0)

So that's weird. I had run bundle install from inside an rvm gemset, yet it was being compiled against the libruby that ships with OSX. I had always suspecting that the problem was a leak from another rvm-installed Ruby, so this was definitely a surprise.

Just out of curiosity, I ran which bundle:

$ which bundle
/usr/bin/bundle

Ok, now I knew something was rotten. I printed out the $PATH:

$ echo $PATH
/Users/wycats/.rvm/gems/ruby-1.9.2-p0/bin:...
$ ls /Users/wycats/.rvm/gems/ruby-1.9.2-p0/bin
asdf*             erubis*           rake2thor*        sc-build-number*
autospec*         prettify_json.rb* redcloth*         sc-docs*
bundle*           rackup*           ruby-prof*        sc-gen*
edit_json.rb*     rake*             sc-build*         sc-init*

In other words, a big fat WTF.

I asked around, and some people had the vague idea that there was a $PATH cache in Unix shells. Someone pointed me at this post about clearing the cache.

Sure enough, running hash -r fixed the output of which. I alerted Wayne of rvm to this problem, and he threw in a fix to rvm that cleared the cache when switching rvms. Momentarily, everything seemed fine.

Digging Further

I still didn't exactly understand how this condition could happen in the first place. When I went digging, I discovered that shells almost uniformly clear the path cache when modifying the $PATH. Since rvm pushes its bin directory onto the $PATH when you switch to a different rvm, I couldn't understand how exactly this problem was happening.

I read through Chapter 3 of the zsh guide, and it finally clicked:

The way commands are stored has other consequences. In particular, zsh won't look for a new command if it already knows where to find one. If I put a new ls command in /usr/local/bin in the above example, zsh would continue to use /bin/ls (assuming it had already been found). To fix this, there is the command rehash, which actually empties the command hash table, so that finding commands starts again from scratch. Users of csh may remember having to type rehash quite a lot with new commands: it's not so bad in zsh, because if no command was already hashed, or the existing one disappeared, zsh will automatically scan the path again; furthermore, zsh performs a rehash of its own accord if $path is altered. So adding a new duplicate command somewhere towards the head of $path is the main reason for needing rehash.

By using the hash command (which prints out all the entries in this cache), I was able to confirm that the same behavior exists in bash, but it seems that the which command (which I was using for testing the problem) implicitly rehashes in bash.

During this time, I also took a look at /usr/bin/bundle, which looks like this:

$ cat /usr/bin/ruby
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'bundler' is installed as part of a gem, and
# this file is here to facilitate running it.
#

require 'rubygems'

version = ">= 0"

if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end

gem 'bundler', version
load Gem.bin_path('bundler', 'bundle', version)

As you can see, executable wrappers created by Rubygems hardcode the version of Ruby that was used to install them. This is one of the reasons that rvm keeps its own directory for installed executables.

A Working Hypothesis

It took me some time to figure all this out, during which time I was speaking to a bunch of friends and the guys in #zsh. I formed a working hypothesis. I figured that people were doing something like this:

  1. Install bundler onto their system, and use it there
  2. Need to work on a new project, so switch to rvm, and create a new gemset
  3. Run bundle install, resulting in an error
  4. Run gem install bundler, to install bundler
  5. Run bundle install, which works, but has a subtle bug

Let's walk through each of these steps, and unpack exactly what happens.

Install bundler on their system

This results in an executable at /usr/bin/bundle that looks like this:

#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'bundler' is installed as part of a gem, and
# this file is here to facilitate running it.
#

require 'rubygems'

version = ">= 0"

if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end

gem 'bundler', version
load Gem.bin_path('bundler', 'bundle', version)

There are two important things here. First, it hardcodes the shebang to the system Ruby location. Second, it uses Rubygems to look up the location of the executable shipped with bundler, which will respect GEM_HOME.

Switch to rvm, and create a new gemset

Switching to an rvm gemset does a few relevant things. First, it prepends ~/.rvm/gems/ruby-1.9.2-p0@gemset/bin onto the $PATH. This effectively resets the shell's built-in command cache. Second, it sets $GEM_HOME to ~/.rvm/gems/ruby-1.9.2-p0. The $GEM_HOME is where Ruby both looks for gems as well as where it installs gems.

Run bundle install, resulting in an error

Specifically,

$ bundle install
/Library/Ruby/Site/1.8/rubygems.rb:777:in `report_activate_error': 
Could not find RubyGem bundler (>= 0) (Gem::LoadError)
	from /Library/Ruby/Site/1.8/rubygems.rb:211:in `activate'
	from /Library/Ruby/Site/1.8/rubygems.rb:1056:in `gem'
	from /usr/bin/bundle:18

Most people don't take such a close look at this error, interpreting it as the equivalent of command not found: bundle. What's actually happening is a bit different. Since bundle is installed at /usr/bin/bundle, the shell finds it, and runs it. It uses the system Ruby (hardcoded in its shebang), and the $GEM_HOME set by rvm. Since I just created a brand new gemset, the bundler gem is not found in $GEM_HOME. As a result, the line gem 'bundler', version in the executable fails with the error I showed above.

However, because the shell found the executable, it ends up in the shell's command cache. You can see the command cache by typing hash. In both bash and zsh, the command cache will include an entry for bundle pointing at /usr/bin/bundle. zsh is a more aggressive about populating the cache, so you'll see the list of all commands in the system in the command cache (which, you'll recall, was reset when you first switched into the gemset, because $PATH was altered).

Run gem install bundler, to install bundler

This will install the bundle executable to ~/.rvm/gems/ruby-1.9.2-p0@gemset/bin, which is the first entry on the $PATH. It will also install bundler to the $GEM_HOME.

Run bundle install, which works, but has a subtle bug

Here's where the fun happens. Since we didn't modify the cache, or call hash -r, the next call to bundle still picks up the bundle in /usr/bin/bundle, which is hardcoded to use system Ruby. However, it will use the version of Bundler we just installed, since it was installed to $GEM_HOME, and Rubygems uses that to look for gems. In other words, even though we're using system Ruby, the /usr/bin/bundle executable will use the rvm gemset's Bundler gem we just installed.

Additionally, because $GEM_HOME points to the rvm gemset, bundler will install all its gems to the rvm gemset. Taken together, these factors make it almost completely irrelevant that the system Ruby was used to install gems. After all, who cares which Ruby installed the gems, as long as they end up in the right place for the rvm's Ruby.

There are actually two problems. First, as we've seen, the version of Ruby that installs a gem also hardcodes itself into the shebang of the executables for the gem.

Try this experiment:

$ rvm use 1.9.2
$ rvm gemset create experiment
$ rvm gemset use experiment
$ /usr/bin/gem install rack
$ cat ~/.rvm/gems/ruby-1.9.2-p0@experiment/bin/rackup
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'rack' is installed as part of a gem, and
# this file is here to facilitate running it.
#

require 'rubygems'

version = ">= 0"

if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end

gem 'rack', version
load Gem.bin_path('rack', 'rackup', version)

As a result, any gems that bundler installed will put their executables in the gemset's location, but still be hardcoded to use the system's Ruby. This is not actually the problem we're encountering here, but it could cause some weird results.

More concerningly, since our system Ruby installed the gems, it will also link any C extensions it compiles against its own copy of libruby. This shouldn't be a major issue if you're using a version of Ruby 1.8 in rvm, but it is a major issue if you're using Ruby 1.9.

Now, we have a gem stored in your rvm's Ruby that has a .bundle in it that is linked against the wrong Ruby. When running your application, Ruby 1.9 will try to load it, and kaboom: segfault.

Postscript

The crazy thing about this story is that it's a lot of factors conspiring to cause problems. The combination of the shell command cache, $GEM_HOME making it look like SYSTEM Ruby was doing the right thing, and hardcoding the version of Ruby in the shebang line of installed gems made this segfault possible.

Thankfully, now that we figured the problem out, the latest version of rvm fixes it. The solution: wrap the gem command in a shell function that calls hash -r afterward.

Amazing.