Play with GC: mark your treasure

May 21, 2008 – 9:48 pm

bug.pngGarbage collecting is amazingly useful. It is a must-have of any modern language. You never need to concern about when to free the allocated memory again. Just allocate, those objects not used will be collected automatically at a some time.

Yes, it’s true. But, wait, it’s not true! I still remember Bjarne Stroustrup had said this:

Complexity will go somewhere: if not the language then the application code.

There will be someone that will be frustrated at dealing with all those garbages. Sometimes that person is just you. So knowing how garbage collector works is still necessary. Though I was always interested in garbage collecting algorithms, I only realized this after I spent a whole after debugging a frustrating heisenbug.

The bug was with my newly implemented rmmseg-cpp. It will segfault on some input. Generally the problem is:

  1. When tested in pure-C++ environment, it runs without problem (at least for all my existing test data). But when tested with Ruby wrapper, it will fail on some test data.
  2. It is not random, it will fail on some data and fail at the same place all the time. However, minor change in the input data will have significant impact to the result — the bug may or may not occur, and even though it occurred again, it would be at a totally different place.

The 1st problem makes it hard to debug with gdb because the extension is dynamically loaded with a shared library. The 2nd problem makes it impossible to reduce the buggy data to a manageable size. What’s worse with typical heisenbug is that minor change (like adding a printf) may make the bug disappeared. My situation is much better than that, I can still use printf to diagnose — though the position of the bug changed, it is still there.

Following the error message and the dumped core file, I find it must have something to do with GC. Specifically, my data has been collected as garbage when I want to use it. This will definitely not be true for any garbage collecting algorithm.

The best case is to write code in a GC-enabled language, you never worry about the memory. A worse situation is to write code in C/C++ where memory is managed manually. The worst case is to write an extension in C/C++ for a GC-enabled language, in other words, to deal with a garbage collector in a language where there shouldn’t be a garbage collector.

There are various garbage collecting algorithms. Python use reference-counting plus special code to break cycle-referencing. Rubinius use some modern Generational GC algorithm if I remember correctly. And MRI used a plain old mark-and-sweep scheme: from some root objects, recursively mark all objects referenced by those roots. All un-marked object is then regards as garbage and collected. Pretty simple.

After following the long output of printf, I finally find the problem. I have such a struct:

    struct Token
    {
        VALUE text;
        VALUE start;
        VALUE end;
    };

where text is a Ruby String object. It has been dead before the Token die. That’s weird, because Ruby provided a hook to call when it is marking your custom-defined struct:

    static void tk_mark(Token *t)
    {
        // start and end are Fixnums, no need to mark
        rb_gc_mark(t->text);
    }

So if the Token object is marked, the String object should also be marked. After reading the code again and again, I found that the String is in fact collected before it gets referenced by the Token object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
    static VALUE tk_create(const char* base, const rmmseg::Token &t)
    {
        Token *tk = ALLOC(Token);
        int start = t.text-base;
 
        tk->text = rb_str_new(t.text, t.length);
 
        tk->start = INT2FIX(start);
        tk->end = INT2FIX(start + t.length);
        return Data_Wrap_Struct(cToken,
                                (RUBY_DATA_FUNC)tk_mark,
                                (RUBY_DATA_FUNC)tk_free,
                                tk);
    }

It was created on line 6, and collected before line 10. Guile has things like scm_remember_upto_here_1 to cope with this kind of problem. Ruby GC is smart to try to identify pointers from local stack thus there’s generally no need to call things like remember_up_to_here explicitly. But it seems it failed to recognize the pointer inside the Token struct.

I re-write line 6 like this:

        VALUE text = rb_str_new(t.text, t.length);
        tk->text = text;

then the bug is gone. But I immediately realized that this might be optimized away by the smart compiler. Adding the -O2 and tried again — segfaulted again. How to avoid the optimization? While not getting any idea, I wrote this brute-force solution:

        tk->text = rb_str_new(t.text, t.length);
        rb_gc_mark(tk->text);

Mark it immediately after creation. It seems good. The bug is gone. But there’s a chance that those marked String objects will not be collected until a second run of GC. And what’s worse is that another bug is introduced. This is even weirder than before — this time it’s some internal node object (used to hold executable code of Ruby) is being used after being collected.

Maybe the garbage collector is totally confused by my brute-force marking. So, to end this long story, here’s the final solution that solves every problem:

        volatile VALUE text = rb_str_new(t.text, t.length);
        tk->text = text;

It seems this is my first time to use the volatile keyword in any non-trivial code after I learned C/C++. :)

  1. 9 Responses to “Play with GC: mark your treasure”

  2. GC不会是老式C编译的吧

    By chris on Sep 4, 2008

  3. @chris,
    什么意思?

    By pluskid on Sep 4, 2008

  4. 感觉是编译器的问题么

    By chris on Sep 4, 2008

  5. @chris,
    这是正常的优化行为吧?

    By pluskid on Sep 4, 2008

  6. 镐头书Duck Typing有个大量构造字符串引发GC的例子是不是和这儿相同?应该GC自身实现吧,你怎么会想到加上volatile,不是编译器特殊优化么?

    By chris on Sep 4, 2008

  7. @chris,
    没看过镐头书,不过编译器把局部变量优化掉当属正常的优化行为,而 Ruby GC 也确实要通过局部变量(或许还有 register )来猜测引用的对象呢。

    By pluskid on Sep 5, 2008

  8. 为什么

    VALUE text = rb_str_new(t.text, t.length);
    tk->text = text;

    可以解决问题,tk->text 重载了等号吗?

    By quark on Sep 10, 2011

  9. @quark,

    这样还不能完全解决,必须要加上 volatile ,防止编译器优化掉。这里的问题是代码执行到一半,token 对象(是 Ruby 对象,通过 Data_Wrap_Struct 来构造的那个,不是 Token 那个 struct )还没有构造好,text 就被 GC 了。

    Ruby GC 在运行的时候同时会扫描 C 的 stack ,如果你有一个 VALUE text 在 stack 上,它就不会被 GC 掉,但是对于你自定义的 struct Token ,Ruby 的 GC 并不认识,所以即使你的 struct Token 里有个 field 指向那个 VALUE text ,它还是无法避免被 GC 掉,所以需要显式地搞一个变量在那里。

    和重载不重载等号没有关系(实际是,这里没有重载等号),关键是前面一句。

    By pluskid on Sep 10, 2011

  1. 1 Trackback(s)

  2. Sep 10, 2011: 同义反复 « Free Mind

Post a Comment