The Craziest F***ing Bug I've Ever Seen

This afternoon, I was telling a friend about one of my exploits tracking down a pretty crazy heisenbug, and he said he thought other people would be interested in hearing about it. So let me tell you about it.

Before you continue, if you're not interested in relatively arcane technical details, feel free to skip this post. It's here mainly because a friend said he thought people would be interested in it.

Our Story Begins

Our story begins with a small change in the way Ruby 1.9 handles implicit coercion. The most common case of implicit coercion is in Array#flatten. The basic logic is that Ruby checks to see if an element of the Array can be itself coerced into an Array, which is then does before flattening.

There are two steps to the process:

Check to see if the element can be coerced into an Array
If it can, call to_ary on the element, and repeat the process recursively

In Ruby 1.8, the process is essentially the following:

if obj.respond_to?(:to_ary)
  obj.__send__(:to_ary)
else
  obj
end

In Ruby 1.9, it was changed to:

begin
  obj.__send__(:to_ary)
rescue NoMethodError
  obj
end

Of course, the internal code was implemented in C, but you get the idea. This change subtly effects objects that implement method_missing:

class MyObject
  def method_missing(meth, *)
    if meth == :testing
      puts "TESTING"
    else
      puts "Calling Super"
      super
    end
  end
end

In Ruby 1.8, #flatten will first call #respond_to? which returns false and therefore doesn't ever trigger the #method_missing. In Ruby 1.9, #method_missing will be triggered blindly, and because the call to super should raise a NoMethodError, we will essentially get the same result.

There are some subtle differences here, but for the vast majority of cases, the behavior is identical or close enough to not matter.

A Weird Quirk

When the above change landed in Ruby 1.9.2's head, Rails started experiencing weird, intermittent behavior. As you might know, Rails overrides method_missing on NilClass to provide a guess about what object you were expecting instead of nil. This feature is called "whiny nils" and can be enabled or disabled in Rails applications.

def method_missing(method, *args, &block)
  if klass = METHOD_CLASS_MAP[method]
    raise_nil_warning_for klass, method, caller
  else
    super
  end
end

But sometimes, when nil was inside an Array, and the Array was flattened, the flatten failed with a NameError: undefined local variable or method `to_ary' for nil:NilClass, which resulted in a failure in the flatten entirely.

The bug appeared intermittently, apparently unrelated to the code that was calling it. We worked around it by catching the NameError and reraising a NoMethodError, but the existence of the bug was rather baffling.

The Bug Reappears

When working on bundler, Carl and I saw the bug again. This time, we didn't want to let it go. In this case, we had four virtually identical tests with four identical stack traces leading to NoMethodError or NameError seemingly at random. Something didn't add up.

After hunting bugs for a few hours (having dug deeply into Ruby's source), I brought in Evan Phoenix (of Rubinius). At first, Evan was baffled as well, but he correctly pointed out that the key to understanding what was going on was the difference between NameError and NoMethodError in Ruby.

x = Object.new
x.no_method #=> NoMethodError

class Testing
  def vcall_no_method
    no_method #=> NameError
  end

  def call_no_method
    self.no_method #=> NoMethodError
  end
end

In essence, when Ruby sees a method call with implicit self (which might be a typo'ed local variable), it raises a NameError, rather than a NoMethodError.

However, it still didn't add up, because the call to to_ary was internal, and shouldn't have been randomly interpreted as a vcall (a call to a method with implicit self) or a normal call.

Tracking it Down

Evan dug deeply into the Ruby source, and found how Ruby was determining whether the call was a vcall or a regular call.

static inline VALUE
method_missing(VALUE obj, ID id, int argc, const VALUE *argv, int call_status)
{
    VALUE *nargv, result, argv_ary = 0;
    rb_thread_t *th = GET_THREAD();

    th->method_missing_reason = call_status;
    th->passed_block = 0;
    // ...
}

Before calling the method_missing method itself, Ruby sets a thread-local variable called call_status that reflects whether or not the original call was a vcall or a normal call.

Upon further examination, Evan discovered that the call to method_missing in the case of type coercion did not change the thread-local method_missing_reason.

As a result, Ruby raised a NoMethodError or NameError based upon whether the last call to method_missing was a vcall or regular call.

The fix was a simple one-line patch, which allowed us to remove the workaround in ActiveSupport, but it was by far the craziest f***ing bug I've ever seen.

The Craziest F***ing Bug I've Ever Seen

Our Story Begins

A Weird Quirk

The Bug Reappears

Tracking it Down

Enjoy this Post?