[ANN]RMMSeg 0.0.1 Released

February 1, 2008 – 5:49 pm

RMMSeg 是 MMSEG 中文分词算法的 Ruby 实现。可以作为独立的程序运行,也可以方便地和 Ferret 进行集成。

今天凌晨在完成了与 Ferret 的集成工作以后,我发布了 0.0.1 版,可以从 RubyForge 进行下载,也可以直接使用 RubyGems 进行安装:

$ sudo gem install rmmseg

下面是引用 RubyForge 上的 Announcement


rmmseg version 0.0.1 has been released!

RMMSeg is an implementation of MMSEG Chinese word segmentation
algorithm. It is based on two variants of maximum matching
algorithms. Two algorithms are available for using:

* simple algorithm that uses only forward maximum matching.
* complex algorithm that uses three-word chunk maximum matching and 3
aditonal rules to solve ambiguities.

For more information about the algorithm, please refer to the
following essays:

* http://technology.chtsai.org/mmseg/
* http://pluskid.lifegoo.com/?p=261

Changes:

### 0.0.1 / 2008-01-31

* Analyser integration with Ferret.
* rdoc added
* Lazily init the +Word+ objects inside the +Dictionary+.
* Handle English punctuation correctly.

关于用法,在RMMSeg 的主页上有介绍。可以作为单独的程序( rmmseg )使用,也可以和 Ferret 进行集成。我再在这里贴一下和 Ferret 集成的例子,下面是代码:

#!/usr/bin/env ruby
require 'rubygems'
require 'rmmseg'
require 'rmmseg/ferret'
 
analyzer = RMMSeg::Ferret::Analyzer.new
$index = Ferret::Index::Index.new(:analyzer => analyzer)
 
$index << {
  :title => "分词",
  :content => "中文分词比较困难,不像英文那样,直接在空格和标点符号的地方断开就可以了。"
}
$index << {
  :title => "RMMSeg",
  :content => "RMMSeg 我近日做的一个 Ruby 中文分词实现,下一步是和 Ferret 进行集成。"
}
$index << {
  :title => "Ruby 1.9",
  :content => "Ruby 1.9.0 已经发布了,1.9 的一个重大改进就是对 Unicode 的支持。"
}
$index << {
  :title => "Ferret",
  :content => <<end
Ferret is a high-performance, full-featured text search engine library
written for Ruby. It is inspired by Apache Lucene Java project. With
the introduction of Ferret, Ruby users now have one of the fastest and
most flexible search libraries available. And it is surprisingly easy
to use.
END
}
 
def highlight_search(key)
  $index.search_each(%Q!content:"#{key}"!) do |id, score|
    puts "*** Document \"#{$index[id][:title]}\" found with a score of #{score}"
    puts "-"*40
    highlights = $index.highlight("content:#{key}", id,
                                  :field => :content,
                                  :pre_tag => "\033[36m",
                                  :post_tag => "\033[m")
    puts "#{highlights}"
    puts ""
  end
end
 
ARGV.each { |key|
  puts "\033[33mSearching for #{key}...\033[m"
  puts ""
  highlight_search(key)
}
 
# Local Variables:
# coding: utf-8
# End:

这是运行结果的截图:

rmmseg.png

Cool 吧? :D

Post a Comment