rmmseg-cpp: rmmseg in C++
May 21, 2008 – 4:09 pmRMMSeg is an implementation of MMSEG Chinese word segmentation algorithm. It features full integration with Ferret. The original version is written in pure-Ruby, which includes two algorithms:
- Complex Algorithm: Maximum matching with three-word chunk filtering. The accuracy is good. But the performance is very bad — it is very slow, consuming lots of memory and there even seems to be memory leaking.
- Simple Algorithm: Simple maximum matching algorithm. The performance is relatively acceptable, the accuracy is also not too bad, but definitely not as good as the Complex Algorithm.
I tried various ways to improve the performance and achieved some improvements. But the result is not so good for real production. And there are also strange leaking. I tried various tools like ruby-prof and BleakHouse. They all showed that I’m not leaking, but the memory usage is definitely growing (linearly). I have to admit Ruby (MRI, currently) is very slow.
Then yesterday when reading Beautiful Code, and finding some beautiful C codes, I started to get enthusiastic. I got back after supper and started to implement RMMSeg in C++ — rmmseg-cpp.
Now I have something to show off:

With a simple Ruby wrapper, the interface of rmmseg-cpp is almost identical to the original rmmseg. Due to my simple test, it now runs roughly 40 times faster (though I had expected more) while consuming only 10% memory as before.
However, I’d also have to admit coding in C++ is more dangerous than in Ruby (or similar languages). I encountered several segment faults when writing rmmseg-cpp. And I’m still not very sure whether it is really bug-free (I used many tricky stuffs in order to make it faster and more compact. But in fact, no software is really bug-free ). Another drawback is that rmmseg-cpp is less (or difficult) extensible/customizable than rmmseg, because it is not very convenient to do such things.
One Response to “rmmseg-cpp: rmmseg in C++”
You are so great!
By Jiang Jian on May 29, 2008