The Craziest F***ing Bug I've Ever Seen
This afternoon, I was telling a friend about one of my exploits tracking down a pretty crazy heisenbug, and he said he thought other people would be interested in hearing about it. So let me tell you about it.
Before you continue, if you're not interested in relatively arcane technical details, feel free to skip this post. It's here mainly because a friend said he thought people would be interested in it.
Our Story Begins
Our story begins with a small change in the way Ruby 1.9 handles implicit coercion. The most common case of implicit coercion is in Array#flatten
. The basic logic is that Ruby checks to see if an element of the Array can be itself coerced into an Array, which is then does before flattening.
There are two steps to the process:
- Check to see if the element can be coerced into an Array
- If it can, call to_ary on the element, and repeat the process recursively
In Ruby 1.8, the process is essentially the following:
if obj.respond_to?(:to_ary)
obj.__send__(:to_ary)
else
obj
end
In Ruby 1.9, it was changed to:
begin
obj.__send__(:to_ary)
rescue NoMethodError
obj
end
Of course, the internal code was implemented in C, but you get the idea. This change subtly effects objects that implement method_missing
:
class MyObject
def method_missing(meth, *)
if meth == :testing
puts "TESTING"
else
puts "Calling Super"
super
end
end
end
In Ruby 1.8, #flatten
will first call #respond_to?
which returns false and therefore doesn't ever trigger the #method_missing
. In Ruby 1.9, #method_missing
will be triggered blindly, and because the call to super
should raise a NoMethodError
, we will essentially get the same result.
There are some subtle differences here, but for the vast majority of cases, the behavior is identical or close enough to not matter.
A Weird Quirk
When the above change landed in Ruby 1.9.2's head, Rails started experiencing weird, intermittent behavior. As you might know, Rails overrides method_missing
on NilClass to provide a guess about what object you were expecting instead of nil. This feature is called "whiny nils" and can be enabled or disabled in Rails applications.
def method_missing(method, *args, &block)
if klass = METHOD_CLASS_MAP[method]
raise_nil_warning_for klass, method, caller
else
super
end
end
But sometimes, when nil was inside an Array, and the Array was flattened, the flatten failed with a NameError: undefined local variable or method `to_ary' for nil:NilClass
, which resulted in a failure in the flatten entirely.
The bug appeared intermittently, apparently unrelated to the code that was calling it. We worked around it by catching the NameError and reraising a NoMethodError, but the existence of the bug was rather baffling.
The Bug Reappears
When working on bundler, Carl and I saw the bug again. This time, we didn't want to let it go. In this case, we had four virtually identical tests with four identical stack traces leading to NoMethodError or NameError seemingly at random. Something didn't add up.
After hunting bugs for a few hours (having dug deeply into Ruby's source), I brought in Evan Phoenix (of Rubinius). At first, Evan was baffled as well, but he correctly pointed out that the key to understanding what was going on was the difference between NameError and NoMethodError in Ruby.
x = Object.new
x.no_method #=> NoMethodError
class Testing
def vcall_no_method
no_method #=> NameError
end
def call_no_method
self.no_method #=> NoMethodError
end
end
In essence, when Ruby sees a method call with implicit self (which might be a typo'ed local variable), it raises a NameError, rather than a NoMethodError.
However, it still didn't add up, because the call to to_ary
was internal, and shouldn't have been randomly interpreted as a vcall
(a call to a method with implicit self) or a normal call.
Tracking it Down
Evan dug deeply into the Ruby source, and found how Ruby was determining whether the call was a vcall
or a regular call.
static inline VALUE
method_missing(VALUE obj, ID id, int argc, const VALUE *argv, int call_status)
{
VALUE *nargv, result, argv_ary = 0;
rb_thread_t *th = GET_THREAD();
th->method_missing_reason = call_status;
th->passed_block = 0;
// ...
}
Before calling the method_missing
method itself, Ruby sets a thread-local variable called call_status
that reflects whether or not the original call was a vcall
or a normal call.
Upon further examination, Evan discovered that the call to method_missing
in the case of type coercion did not change the thread-local method_missing_reason
.
As a result, Ruby raised a NoMethodError or NameError based upon whether the last call to method_missing
was a vcall
or regular call.
The fix was a simple one-line patch, which allowed us to remove the workaround in ActiveSupport, but it was by far the craziest f***ing bug I've ever seen.