rmmseg-cpp: rmmseg in C++

May 21, 2008 – 4:09 pm

RMMSeg is an implementation of MMSEG Chinese word segmentation algorithm. It features full integration with Ferret. The original version is written in pure-Ruby, which includes two algorithms:

  • Complex Algorithm: Maximum matching with three-word chunk filtering. The accuracy is good. But the performance is very bad — it is very slow, consuming lots of memory and there even seems to be memory leaking.
  • Simple Algorithm: Simple maximum matching algorithm. The performance is relatively acceptable, the accuracy is also not too bad, but definitely not as good as the Complex Algorithm.

I tried various ways to improve the performance and achieved some improvements. But the result is not so good for real production. And there are also strange leaking. I tried various tools like ruby-prof and BleakHouse. They all showed that I’m not leaking, but the memory usage is definitely growing (linearly). I have to admit Ruby (MRI, currently) is very slow.

Then yesterday when reading Beautiful Code, and finding some beautiful C codes, I started to get enthusiastic. I got back after supper and started to implement RMMSeg in C++ — rmmseg-cpp.

Now I have something to show off:


With a simple Ruby wrapper, the interface of rmmseg-cpp is almost identical to the original rmmseg. Due to my simple test, it now runs roughly 40 times faster (though I had expected more) while consuming only 10% memory as before.

However, I’d also have to admit coding in C++ is more dangerous than in Ruby (or similar languages). I encountered several segment faults when writing rmmseg-cpp. And I’m still not very sure whether it is really bug-free (I used many tricky stuffs in order to make it faster and more compact. But in fact, no software is really bug-free :p ). Another drawback is that rmmseg-cpp is less (or difficult) extensible/customizable than rmmseg, because it is not very convenient to do such things.

  1. One Response to “rmmseg-cpp: rmmseg in C++”

  2. You are so great!

    By Jiang Jian on May 29, 2008

Post a Comment