Yehuda Katz is a member of the Ember.js, Ruby on Rails and jQuery Core Teams; he spends his daytime hours at the startup he founded, Tilde Inc.. Yehuda is co-author of best-selling jQuery in Action and Rails 3 in Action. He spends most of his time hacking on open source—his main projects, like Thor, Handlebars and Janus—or traveling the world doing evangelism work. He can be found on Twitter as @wycats and on Github.

A Tale of Abort Traps (or Always Question Your Assumptions)

For a few months now, the bundler team has been getting intermittent reports of segfaults in C extensions that happen when using bundler with rvm. A cursory investigation revealed that the issue was that the C extensions were compiled for the wrong version of Ruby.

For instance, we would get reports of segfaults in nokogiri when using Ruby 1.9 that resulted from the $GEM_HOME in Ruby 1.9 containing a .bundle file compiled for Ruby 1.8. We got a lot of really angry bug reports, and a lot of speculation that we were doing something wrong that could be obviously fixed.

I finally ran into the issue myself, on my own machine a couple days ago, and tracked it down, deep into the fires of Mount Doom. A word of warning: this story may shock you.

It Begins

I usually use rvm for my day-to-day work, but I tend to not use gemsets unless I want a guaranteed clean environment for debugging something. Bundler takes care of isolation, even in the face of a lot of different gems mixed together in a global gemset. Last week, I was deep in the process of debugging another issue, so I was making pretty heavy use of gemset (and rvm gemset empty).

I had switched to Ruby 1.9.2 (final, which had just come out), created a new gemset, and run bundle install on the project I was working on, which included thin in the Gemfile. I had run bundle install to install the gems in the project, and among other things, bundler installed thin “with native extensions”. I use bundler a lot, and I had run this exact command probably thousands of times.

This time, however, I got a segfault in Ruby 1.9, that pointed at the require call to the rubyeventmachine.bundle file.

Debugging

Now that I had the bug on a physical machine, I started debugging. Our working hypothesis was that bundler or Rubygems was somehow compiling gems against Ruby 1.8, even when on Ruby 1.9, but we couldn’t figure out exactly how it could be happening.

The first thing I did was run otool -L on the binary (rubyeventmachine.bundle):

$ otool -L rubyeventmachine.bundle 
rubyeventmachine.bundle:
  /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/libruby.1.dylib
  (compatibility version 1.8.0, current version 1.8.7)
  /usr/lib/libssl.0.9.8.dylib (compatibility version 0.9.8, current version 0.9.8)
  /usr/lib/libcrypto.0.9.8.dylib (compatibility version 0.9.8, current version 0.9.8)
  /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.3)
  /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 125.0.1)
  /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.9.0)

So that’s weird. I had run bundle install from inside an rvm gemset, yet it was being compiled against the libruby that ships with OSX. I had always suspecting that the problem was a leak from another rvm-installed Ruby, so this was definitely a surprise.

Just out of curiosity, I ran which bundle:

$ which bundle
/usr/bin/bundle

Ok, now I knew something was rotten. I printed out the $PATH:

$ echo $PATH
/Users/wycats/.rvm/gems/ruby-1.9.2-p0/bin:...
$ ls /Users/wycats/.rvm/gems/ruby-1.9.2-p0/bin
asdf*             erubis*           rake2thor*        sc-build-number*
autospec*         prettify_json.rb* redcloth*         sc-docs*
bundle*           rackup*           ruby-prof*        sc-gen*
edit_json.rb*     rake*             sc-build*         sc-init*

In other words, a big fat WTF.

I asked around, and some people had the vague idea that there was a $PATH cache in Unix shells. Someone pointed me at this post about clearing the cache.

Sure enough, running hash -r fixed the output of which. I alerted Wayne of rvm to this problem, and he threw in a fix to rvm that cleared the cache when switching rvms. Momentarily, everything seemed fine.

Digging Further

I still didn’t exactly understand how this condition could happen in the first place. When I went digging, I discovered that shells almost uniformly clear the path cache when modifying the $PATH. Since rvm pushes its bin directory onto the $PATH when you switch to a different rvm, I couldn’t understand how exactly this problem was happening.

I read through Chapter 3 of the zsh guide, and it finally clicked:

The way commands are stored has other consequences. In particular, zsh won’t look for a new command if it already knows where to find one. If I put a new ls command in /usr/local/bin in the above example, zsh would continue to use /bin/ls (assuming it had already been found). To fix this, there is the command rehash, which actually empties the command hash table, so that finding commands starts again from scratch. Users of csh may remember having to type rehash quite a lot with new commands: it’s not so bad in zsh, because if no command was already hashed, or the existing one disappeared, zsh will automatically scan the path again; furthermore, zsh performs a rehash of its own accord if $path is altered. So adding a new duplicate command somewhere towards the head of $path is the main reason for needing rehash.

By using the hash command (which prints out all the entries in this cache), I was able to confirm that the same behavior exists in bash, but it seems that the which command (which I was using for testing the problem) implicitly rehashes in bash.

During this time, I also took a look at /usr/bin/bundle, which looks like this:

$ cat /usr/bin/ruby
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'bundler' is installed as part of a gem, and
# this file is here to facilitate running it.
#
 
require 'rubygems'
 
version = ">= 0"
 
if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end
 
gem 'bundler', version
load Gem.bin_path('bundler', 'bundle', version)

As you can see, executable wrappers created by Rubygems hardcode the version of Ruby that was used to install them. This is one of the reasons that rvm keeps its own directory for installed executables.

A Working Hypothesis

It took me some time to figure all this out, during which time I was speaking to a bunch of friends and the guys in #zsh. I formed a working hypothesis. I figured that people were doing something like this:

  1. Install bundler onto their system, and use it there
  2. Need to work on a new project, so switch to rvm, and create a new gemset
  3. Run bundle install, resulting in an error
  4. Run gem install bundler, to install bundler
  5. Run bundle install, which works, but has a subtle bug

Let’s walk through each of these steps, and unpack exactly what happens.

Install bundler on their system

This results in an executable at /usr/bin/bundle that looks like this:

#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'bundler' is installed as part of a gem, and
# this file is here to facilitate running it.
#
 
require 'rubygems'
 
version = ">= 0"
 
if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end
 
gem 'bundler', version
load Gem.bin_path('bundler', 'bundle', version)

There are two important things here. First, it hardcodes the shebang to the system Ruby location. Second, it uses Rubygems to look up the location of the executable shipped with bundler, which will respect GEM_HOME.

Switch to rvm, and create a new gemset

Switching to an rvm gemset does a few relevant things. First, it prepends ~/.rvm/gems/ruby-1.9.2-p0@gemset/bin onto the $PATH. This effectively resets the shell’s built-in command cache. Second, it sets $GEM_HOME to ~/.rvm/gems/ruby-1.9.2-p0. The $GEM_HOME is where Ruby both looks for gems as well as where it installs gems.

Run bundle install, resulting in an error

Specifically,

$ bundle install
/Library/Ruby/Site/1.8/rubygems.rb:777:in `report_activate_error': 
Could not find RubyGem bundler (>= 0) (Gem::LoadError)
	from /Library/Ruby/Site/1.8/rubygems.rb:211:in `activate'
	from /Library/Ruby/Site/1.8/rubygems.rb:1056:in `gem'
	from /usr/bin/bundle:18

Most people don’t take such a close look at this error, interpreting it as the equivalent of command not found: bundle. What’s actually happening is a bit different. Since bundle is installed at /usr/bin/bundle, the shell finds it, and runs it. It uses the system Ruby (hardcoded in its shebang), and the $GEM_HOME set by rvm. Since I just created a brand new gemset, the bundler gem is not found in $GEM_HOME. As a result, the line gem 'bundler', version in the executable fails with the error I showed above.

However, because the shell found the executable, it ends up in the shell’s command cache. You can see the command cache by typing hash. In both bash and zsh, the command cache will include an entry for bundle pointing at /usr/bin/bundle. zsh is a more aggressive about populating the cache, so you’ll see the list of all commands in the system in the command cache (which, you’ll recall, was reset when you first switched into the gemset, because $PATH was altered).

Run gem install bundler, to install bundler

This will install the bundle executable to ~/.rvm/gems/ruby-1.9.2-p0@gemset/bin, which is the first entry on the $PATH. It will also install bundler to the $GEM_HOME.

Run bundle install, which works, but has a subtle bug

Here’s where the fun happens. Since we didn’t modify the cache, or call hash -r, the next call to bundle still picks up the bundle in /usr/bin/bundle, which is hardcoded to use system Ruby. However, it will use the version of Bundler we just installed, since it was installed to $GEM_HOME, and Rubygems uses that to look for gems. In other words, even though we’re using system Ruby, the /usr/bin/bundle executable will use the rvm gemset’s Bundler gem we just installed.

Additionally, because $GEM_HOME points to the rvm gemset, bundler will install all its gems to the rvm gemset. Taken together, these factors make it almost completely irrelevant that the system Ruby was used to install gems. After all, who cares which Ruby installed the gems, as long as they end up in the right place for the rvm’s Ruby.

There are actually two problems. First, as we’ve seen, the version of Ruby that installs a gem also hardcodes itself into the shebang of the executables for the gem.

Try this experiment:

$ rvm use 1.9.2
$ rvm gemset create experiment
$ rvm gemset use experiment
$ /usr/bin/gem install rack
$ cat ~/.rvm/gems/ruby-1.9.2-p0@experiment/bin/rackup
#!/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby
#
# This file was generated by RubyGems.
#
# The application 'rack' is installed as part of a gem, and
# this file is here to facilitate running it.
#
 
require 'rubygems'
 
version = ">= 0"
 
if ARGV.first =~ /^_(.*)_$/ and Gem::Version.correct? $1 then
  version = $1
  ARGV.shift
end
 
gem 'rack', version
load Gem.bin_path('rack', 'rackup', version)

As a result, any gems that bundler installed will put their executables in the gemset’s location, but still be hardcoded to use the system’s Ruby. This is not actually the problem we’re encountering here, but it could cause some weird results.

More concerningly, since our system Ruby installed the gems, it will also link any C extensions it compiles against its own copy of libruby. This shouldn’t be a major issue if you’re using a version of Ruby 1.8 in rvm, but it is a major issue if you’re using Ruby 1.9.

Now, we have a gem stored in your rvm’s Ruby that has a .bundle in it that is linked against the wrong Ruby. When running your application, Ruby 1.9 will try to load it, and kaboom: segfault.

Postscript

The crazy thing about this story is that it’s a lot of factors conspiring to cause problems. The combination of the shell command cache, $GEM_HOME making it look like SYSTEM Ruby was doing the right thing, and hardcoding the version of Ruby in the shebang line of installed gems made this segfault possible.

Thankfully, now that we figured the problem out, the latest version of rvm fixes it. The solution: wrap the gem command in a shell function that calls hash -r afterward.

Amazing.

23 Responses to “A Tale of Abort Traps (or Always Question Your Assumptions)”

That was indeed an epic tale. The one nice thing a bug like this says, is that the tools for managing ruby environment that are available are now advanced and powerful enough to trip across crazy bugs like this one.

Thankfully we have people smart enough to figure them out :D

Thankyou, Yehuda, for sorting out the issue and posting your steps here. I truly had no idea where to start debugging this when I reported it.

All the work you are doing is very much appreciated here.

Agreed, thank you very much for all your time.

That’s an inspiring bug-hunt tale, and very informative to boot. Thanks. I didn’t know about 90% of that.

(Btw, I think you meant ‘$ cat /usr/bin/bundle’ instead of ‘$ cat /usr/bin/ruby’ above.)

Cool, thanks for digging into this! I had gotten as far into my own debugging as your notes up to the “Working Hypothesis” section before I finally gave up and left a commented on the bug.

There’s actually another subtlety here that makes it even worse. When debugging, I suspected a path cache issue and made sure to clear it but to no avail.

What actually happened was that I
- installed 1.9.2-rc2 using rvm
- immediately ran `gem install bundler –pre` since I didn’t have a system bundler

This is more problematic since `which bundler` (assuming cleared cache) -> /Users/smorel/.rvm/…

i.e. it appears correctly installed but the `gem install` command was still using the system gem command and thus hardcoding the system ruby #! into the bundle wrapper.

In other words, same root cause but even harder to track down since after clearing the caches both
`which gem`
and `which bundle` were the correct rvm versions. The gem wrapper had the correct env ruby shebang but the bundle wrapper had the system hard coded shebang.

Hence the reason why
`gem uninstall $1; gem install $1` was a 100% workaround.

That was quite the bug hunt. Thank you for sharing the details of this one, it was very educative.

How is this cache an even remotely sensible idea? It saves what? A few lstat()s per exec(), all on directories that are invariably in system cache? It was probably stupid even in 1970s.

Well done and thanks for nailing this little rascal Yahuda.

Great analysis of the problem. I love when people don’t give up when encountering weird problem like this one and keep on investigating (to the death ;-). Ability of hunting down cryptic bugs is what differ great developers from average ones.

>> Thankfully we have people smart enough to figure them out :D

Couldn’t agree more. Thanks Yehuda

Nice work tracking this one down, Yehuda. I’ve run into that segfault before and it really had me chasing my tail for a bit.

Here’s a much easier bug to deal with – I think you might be missing the end tag for blockquote after the zsh manual quote =)

I think there’s a thing you have overlooked:

zsh:

% which which
which: shell built-in command

% type which
which is a shell builtin

% which type
type: shell built-in command

csh:

% which which
which: shell built-in command.

% type which
which is /usr/bin/which

% which type
/usr/bin/type

bash:

$ which which
/usr/bin/which

$ type which
which is hashed (/usr/bin/which)

$ type type
type is a shell builtin

TL;DR with bash, which is an external command and it doesn’t care about bash’s cache.

Well done. Very informative and educational. Definitely a great mix of debugging techniques used for that too. Good reminder to hunt for bugs/issues outside of just the code you’re working on.

In my opinion, hardcoding the Ruby executable path into the shebang-line is not a good practice. This nasty bug only gives another reason against it.

It is possible to install the wrappers with an “env” shebang line (“#!/usr/bin/env ruby”); this means that Ruby is run through the “env” command.

Running a program through “env” effectively reassigns the complete environment and then runs the program in that environment. This will also avoid any “rehash” issues.

On my machine, I’ve configured “gem install” to always use the “–env-shebang” option, which does ensure that sane wrappers are generated when installing gems. This can be done through “~/.gemrc”.

“Thankfully, now that we figured the problem out, the latest version of rvm fixes it. The solution: wrap the gem command in a shell function that calls hash -r afterward.”

Just to be perfectly clear: this is applying duct-tape to the problem.

Instead, have gem generate its wrappers with an env-shebang line and no such workarounds will be required. Gem binaries will then always use the ruby which comes first in the PATH when the command is invoked.

I’ve just double-checked in Bash.

sounds like yet another cause of hair loss. Thanks for digging it out!

@Sigi: “Instead, have gem generate its wrappers with an env-shebang line and no such workarounds will be required. Gem binaries will then always use the ruby which comes first in the PATH when the command is invoked.”

That’s not a solution, that will likely make things worse and more confusing. Consider a scenario where the user has multiple Ruby installations, say 1.8 and 1.9. 1.8 is in $PATH by default. User installs ‘foobar’ using the Ruby 1.9 gem command, then types ‘foobar’ on the command line. ‘foobar’ will attempt to use Ruby 1.8 because that’s in PATH, but the gem doesn’t exist in 1.8′s gem path and so it will fail with a cryptic error message.

It is for reasons like this that e.g. Phusion Passenger’s extensive system autodetection code relies on $PATH as little as possible for finding Ruby commands. What you find in $PATH may not be what the user actually wants.

@Tomasz – it was a big deal when many of the directories in the path were mounted on NFS or AFS. The shell was unusable in that case. I’m sure there are other reasons for it, as well, but that’s the one that sprang to mind on reading your comment.

Things were a lot different even 10 years ago, much less 40 years ago. A rule of thumb: it’s almost always a losing proposition to imply that Stephen Bourne (or Paul Falstad or Ken Greer, for that matter) ever, at any time, did something that wasn’t even “remotely sensible”. ;-)

@hongli seems to me that depending on the path to resolve the executable is better. If you switch the ruby, then change the path to reflect the fact.

If i copy/send my code to someone else and ruby is installed on some other directory – then i will have to change the path in each and every file. Does not seem DRY.

ruby has a load_path, the shell has a path.

@hongli: I vote for env-shebang too. It dinamically loads the command, which is just our problem. If you need to handle multiple environments the solution is simple: change the $PATH as you switch the environment.

Hey Yehuda, I found this post when I ran into “Could not find RubyGem bundler (>= 0) (Gem::LoadError)” myself today. Unfortunately, only the symptom is the same, while all the paths seemed to be fine.

I’m on Mac OS X 10.5, so ruby 1.8.7-p72 [i686-darwin9] was pre-installed. Then I installed rvm 1.1.2 system-wide(!) and ruby 1.8.7-p302. “sudo gem install bundler” and “sudo bundle install” work fine.

But then I installed ruby 1.9.2-p0 and switched to it, “sudo gem install bundler”, ran “sudo bundle install” and was prompted with “Could not find RubyGem bundler (>= 0) (Gem::LoadError)”. Tried it again with ruby 1.9.1-p378, but same error.

BUT it does not seem to be a problem with paths:

$ sudo bundle install
Password:
/usr/local/rvm/rubies/ruby-1.9.1-p378/lib/ruby/site_ruby/1.9.1/rubygems.rb:779:in `report_activate_error’: Could not find RubyGem bundler (>= 0) (Gem::LoadError)
from /usr/local/rvm/rubies/ruby-1.9.1-p378/lib/ruby/site_ruby/1.9.1/rubygems.rb:214:in `activate’
from /usr/local/rvm/rubies/ruby-1.9.1-p378/lib/ruby/site_ruby/1.9.1/rubygems.rb:1082:in `gem’
from :273:in `push_gem_version_on_load_path’
from :14:in `gem’
from /usr/local/rvm/gems/ruby-1.9.1-p378/bin/bundle:18:in `’

$ echo $PATH
/usr/local/rvm/gems/ruby-1.9.1-p378/bin:/usr/local/rvm/gems/ruby-1.9.1-p378@global/bin:/usr/local/rvm/rubies/ruby-1.9.1-p378/bin:/usr/local/rvm/bin:/usr/local/bin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11/bin:/usr/local/git/bin:/usr/local/mysql/bin

$ echo $GEM_PATH
/usr/local/rvm/gems/ruby-1.9.1-p378:/usr/local/rvm/gems/ruby-1.9.1-p378@global

$ which bundle
/usr/local/rvm/gems/ruby-1.9.1-p378/bin/bundle

$ cat /usr/local/rvm/gems/ruby-1.9.1-p378/bin/bundle
#!/usr/local/rvm/rubies/ruby-1.9.1-p378/bin/ruby

$ which gem
/usr/local/rvm/rubies/ruby-1.9.1-p378/bin/gem

$ cat /usr/local/rvm/rubies/ruby-1.9.1-p378/bin/gem
#!/usr/bin/env ruby
ENV['GEM_HOME']=ENV['GEM_HOME'] || ‘/usr/local/rvm/gems/ruby-1.9.1-p378′
ENV['GEM_PATH']=ENV['GEM_PATH'] || ‘/usr/local/rvm/gems/ruby-1.9.1-p378:/usr/local/rvm/gems/ruby-1.9.1-p378@global’
ENV['PATH']=’/usr/local/rvm/gems/ruby-1.9.1-p378/bin:/usr/local/rvm/gems/ruby-1.9.1-p378@global/bin:/usr/local/rvm/rubies/ruby-1.9.1-p378/bin:’ + ENV['PATH']

This looked suspicious to me, but:
$ /usr/bin/env ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-darwin9.8.0]

Then I created a new gemset, as you did in your post and ta-da! “sudo bundle install” worked, kind of. Unfortunately, “rake test” did not.

$ rvm gemset create experiment
‘experiment’ gemset created (/usr/local/rvm/gems/ruby-1.9.1-p378@experiment).

$ rvm gemset use experiment
Now using gemset ‘experiment’

$ sudo bundle install
Using rake (0.8.7)
Using abstract (1.0.0)
Using activesupport (3.0.3)
Using builder (2.1.2)
Using i18n (0.4.2)

Note that it is not installing the gems like I would have expected, as there should not be any gems installed for ruby 1.9.1. Anyway, I tried:

$ rake test
(in /Users/janbromberger/Documents/Projects/zvg2)
rake aborted!
uninitialized constant Bundler
/Users/janbromberger/Documents/Projects/zvg2/Rakefile:4:in `require’
(See full trace by running task with –trace)

$ which rake
/usr/local/rvm/gems/ruby-1.9.1-p378@global/bin/rake

See that the global gemset is used here?

$ echo $PATH
/usr/local/rvm/gems/ruby-1.9.1-p378@experiment/bin:/usr/local/rvm/gems/ruby-1.9.1-p378@global/bin:/usr/local/rvm/rubies/ruby-1.9.1-p378/bin:/usr/local/rvm/bin:/usr/local/bin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11/bin:/usr/local/git/bin:/usr/local/mysql/bin

$ echo $GEM_HOME
/usr/local/rvm/gems/ruby-1.9.1-p378@experiment
$ echo $GEM_PATH
/usr/local/rvm/gems/ruby-1.9.1-p378@experiment:/usr/local/rvm/gems/ruby-1.9.1-p378@global

Furthermore
$ ls -l “/usr/local/rvm/gems/ruby-1.9.1-p378@experiment”
is empty. No bin or gems directory, nothing.

Also I noticed, that bundler wasn’t even installed to @global, but to /usr/local/rvm/gems/ruby-1.9.1-p378/gems/

Then I finally dropped the sudo and did “gem install bundler” after having switched back to gemset global. To my surprise (maybe my surprise can be explained by the fact, that I don’t know what a gemset is), this installed bundler to the gemset. I was used to use sudo when installing gems, but apparently this is different with rvm?

$ bundle install
Fetching source index for http://rubygems.org/
Using rake (0.8.7)
Installing abstract (1.0.0)
Installing activesupport (3.0.3)
Installing builder (2.1.2)

(Note that Rails 3 does want ruby 1.9.2, not 1.9.1)

Hope this is an interesting find and not just me doing stuff wrong and then bitching that it does not work. :-)

I had this same set of symptoms running homebrew from inside a bundler environment. In my case, I had a ruby configuration-management script that had called Bundler.setup from its own Gemfile, and then launched a subshell to do “brew update”, but you could replicate it (in that directory) with:

bundle exec brew update

Homebrew’s shebang is, and should be, hardcoded to /usr/bin/ruby, so here the fix isn’t to change the hardcoded shebang. For some reason, the fix involves upgrading the *system* rubygems (at /usr/bin/gem) to a version past 1.8.5. I have 1.8.6, and never had the problem; my co-developers were at 1.8.5, did “/usr/bin/gem update –system” to get to 1.8.15, and their problem went away.

I ran into the same issue using rbenv (https://github.com/sstephenson/rbenv) and not RVM. I was able to fix it there with something like this: bundle exec /usr/bin/env ruby-local-exec MYFILE. (where MYFILE was a ruby executable with a “#!/usr/bin/ruby” shebang.

Leave a Reply

Archives

Categories

Meta