Yehuda Katz is a member of the Ember.js, Ruby on Rails and jQuery Core Teams; his 9-to-5 home is at the startup he founded, Tilde Inc.. There he works on Skylight, the smart profiler for Rails, and does Ember.js consulting. He is best known for his open source work, which also includes Thor and Handlebars. He travels the world doing open source evangelism and web standards work.

The Craziest F***ing Bug I’ve Ever Seen

This afternoon, I was telling a friend about one of my exploits tracking down a pretty crazy heisenbug, and he said he thought other people would be interested in hearing about it. So let me tell you about it.

Before you continue, if you’re not interested in relatively arcane technical details, feel free to skip this post. It’s here mainly because a friend said he thought people would be interested in it.

Our Story Begins

Our story begins with a small change in the way Ruby 1.9 handles implicit coercion. The most common case of implicit coercion is in Array#flatten. The basic logic is that Ruby checks to see if an element of the Array can be itself coerced into an Array, which is then does before flattening.

There are two steps to the process:

  1. Check to see if the element can be coerced into an Array
  2. If it can, call to_ary on the element, and repeat the process recursively

In Ruby 1.8, the process is essentially the following:

if obj.respond_to?(:to_ary)
  obj.__send__(:to_ary)
else
  obj
end

In Ruby 1.9, it was changed to:

begin
  obj.__send__(:to_ary)
rescue NoMethodError
  obj
end

Of course, the internal code was implemented in C, but you get the idea. This change subtly effects objects that implement method_missing:

class MyObject
  def method_missing(meth, *)
    if meth == :testing
      puts "TESTING"
    else
      puts "Calling Super"
      super
    end
  end
end

In Ruby 1.8, #flatten will first call #respond_to? which returns false and therefore doesn’t ever trigger the #method_missing. In Ruby 1.9, #method_missing will be triggered blindly, and because the call to super should raise a NoMethodError, we will essentially get the same result.

There are some subtle differences here, but for the vast majority of cases, the behavior is identical or close enough to not matter.

A Weird Quirk

When the above change landed in Ruby 1.9.2′s head, Rails started experiencing weird, intermittent behavior. As you might know, Rails overrides method_missing on NilClass to provide a guess about what object you were expecting instead of nil. This feature is called “whiny nils” and can be enabled or disabled in Rails applications.

def method_missing(method, *args, &block)
  if klass = METHOD_CLASS_MAP[method]
    raise_nil_warning_for klass, method, caller
  else
    super
  end
end

But sometimes, when nil was inside an Array, and the Array was flattened, the flatten failed with a NameError: undefined local variable or method `to_ary' for nil:NilClass, which resulted in a failure in the flatten entirely.

The bug appeared intermittently, apparently unrelated to the code that was calling it. We worked around it by catching the NameError and reraising a NoMethodError, but the existence of the bug was rather baffling.

The Bug Reappears

When working on bundler, Carl and I saw the bug again. This time, we didn’t want to let it go. In this case, we had four virtually identical tests with four identical stack traces leading to NoMethodError or NameError seemingly at random. Something didn’t add up.

After hunting bugs for a few hours (having dug deeply into Ruby’s source), I brought in Evan Phoenix (of Rubinius). At first, Evan was baffled as well, but he correctly pointed out that the key to understanding what was going on was the difference between NameError and NoMethodError in Ruby.

x = Object.new
x.no_method #=> NoMethodError
 
class Testing
  def vcall_no_method
    no_method #=> NameError
  end
 
  def call_no_method
    self.no_method #=> NoMethodError
  end
end

In essence, when Ruby sees a method call with implicit self (which might be a typo’ed local variable), it raises a NameError, rather than a NoMethodError.

However, it still didn’t add up, because the call to to_ary was internal, and shouldn’t have been randomly interpreted as a vcall (a call to a method with implicit self) or a normal call.

Tracking it Down

Evan dug deeply into the Ruby source, and found how Ruby was determining whether the call was a vcall or a regular call.

static inline VALUE
method_missing(VALUE obj, ID id, int argc, const VALUE *argv, int call_status)
{
    VALUE *nargv, result, argv_ary = 0;
    rb_thread_t *th = GET_THREAD();
 
    th->method_missing_reason = call_status;
    th->passed_block = 0;
    // ...
}

Before calling the method_missing method itself, Ruby sets a thread-local variable called call_status that reflects whether or not the original call was a vcall or a normal call.

Upon further examination, Evan discovered that the call to method_missing in the case of type coercion did not change the thread-local method_missing_reason.

As a result, Ruby raised a NoMethodError or NameError based upon whether the last call to method_missing was a vcall or regular call.

The fix was a simple one-line patch, which allowed us to remove the workaround in ActiveSupport, but it was by far the craziest f***ing bug I’ve ever seen

18 Responses to “The Craziest F***ing Bug I’ve Ever Seen”

Well worth reading. Take your friend’s advice in the future and blog away on such topics.

Agreed. This is very interesting reading and I’m not a Ruby expert by any stretch of the imagination. Everyone loves a sleuthing story :)

Very interesting just to see how you guys thought through the problem. This is one of those thing I would generally work around and forget. Gives me motivation to dig a bit deeper next time.

:-( I don’t see any tests with the patch :-( Something to get the RubySpec guys/gals onto testing!

Yes, we certainly need a RubySpec that covers this.

I ran into this bug two years ago while working on RSpec. I couldn’t immediately find a place where to post a bug report, and unfortunately it then fell off my mind. Stupid me.

Here’s the RSpec patch from two years ago with my explanation:
http://rubyforge.org/tracker/?func=detail&atid=3151&aid=14208&group_id=797

Thanks for sharing!

An equally frustrating heisenbug stymied a Rails team I was working with a few years ago. It could not be predictably reproduced. Some members of the development and QA team could reproduce it (sometimes) and others could not, even when following the exact same steps.

After much head-scratching we finally figured it out. Our target browser was IE6 (it was an intranet application). Some people opened IE6 through the Windows start menu and others through double-clicking the IE6 icon. One way would start a new iexplore.exe process and the other way would simply share an existing iexplore.exe process. When the iexplore.exe process was shared, cookies were shared, which meant the Rails session was shared, which caused the bug!

Haha great stuff! You should send this to Colbert and see if he’ll do a segment on the show about it. :)

Crazy F***ing bug. Got to love how it all came down to a one liner in the end after all the epic poking around.

You shoulda poked me on this one…I have fixed numerous similar issues in JRuby over the years, and they had similar patches.

To be honest, the fact that a vcall produces a different exception type from an fcall or call feels like a bad design to me for two reasons: removing “self” from a no-arg call against self should not cause it to raise a different exception (and really, all missing-method exceptions should be the same regardless of how they’re produced), and it requires tracking that nonsense all the way through to the built-in method_missing, which makes the call logic more complicated to implement for no good reason. The latter reason is clearly evidenced by the fact that a minute bug in the code can lead to drastically different behavior that’s really tricky to hunt down. If vcall always produced the same error fcall/call produce, these bugs would never happen.

My favorite heisenbug revolved around the Digital Research PL/I-86 compiler for DOS. Since all variable arguments were passed by am argument pointer which in turn pointed at a static block of pointer to the actual arguments, which in turn pointed at the addresses of the argument values (e.g. pass-by-reference)… if you passed a constant ‘Y’ (single character) to a method expecting a CHAR(1) it could CHANGE the value of the _one and only_ constant known to the linker… thus code like upper-casing a character might change the actual constant somewhere else.

Fun days.

Great job tracking this down, but I have a question: why was the change in 1.9 made in the first place? What if to_ary is in fact implemented but a bug in its implementation raises a NoMethodError? Wouldn’t Array#flatten then fail silently?

I’m assuming there’s some legitimate reason for the change – I’d like to understand it :)

Doesn’t this all just show that the 1.9 implementation is flat wrong? I laughed when I saw this:

begin
obj.__send__(:to_ary)
rescue NoMethodError
obj
end

There is NO check that the NoMethodError is coming from the object in question, nor that the method that raised the error is :to_ary.

If you are going to intercept method_missing, you also need to hit resond_to?, methods, and possible class.instance_methods. Anything less is an error, and if 1.9 is trying to cover for it, it is just going to make things worse down the road.

I think using exceptions for flow control is asking for trouble. Save exceptions for exceptional things.

Ask if it quacks like a duck, don’t just shoot it and see if it looked like a duck after you killed it…

With Ruby 1.9, any object that implements method missing via metaprogramming needs to special-case to_ary. That’s a pretty standard idiom that’s now broken, isn’t it?

Here’s one real-world example, I’m sure there’s more: http://github.com/bronson/gitrb/commit/12fd0490bb9b0f6f4b7ba7932ff626fe364cd607

So, I’m with the other commenters: why was this change made? Oh well, I guess it’s a little late now!

I just got this bug as well. Even with 1.9.2-head, I’m having problems with it. Thanks for the post, should help me solve it.

I just stumbled on this bug using Redis. A list with a Redis object in it dies if you try to flatten it, because the :to_ary method is passed on to Redis which thinks it’s a Redis command!

I really think this was a bad idea! So now i have to do what? Add a to_ary method that returns self? Or add a to_ary method that raises a NoMethodError :) lol

Thanks for this post – saved my day.

We had a problem in a 3rd party library not working with Ruby 1.9.3, when it should be supported by 1.9.1. We got a mysterious error that to_ary method is not supported. It turned out that the problem was a class that defined method_missing that provided a more verbose error message – but it raised RuntimeError instead of NoMethodError. The change in Ruby calling triggered method_missing which raised an error that was not caugth. Chaning the error type fixed the problem.

Leave a Reply

Archives

Categories

Meta