快捷搜索:

有关Lucene的问题(2):stemming和lemmatization<BR><strong>问

问题:

我试验了一下文章中提到的 stemming 和 lemmatization

将单词缩减为词根形式,如“cars”到“car”等。这种操作称为:stemming。

将单词转变为词根形式,如“drove”到“drive”等。这种操作称为:lemmatization。

试验没有成功

代码如下:

public class TestNorms {

public void createIndex() throws IOException {

Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));

IndexWriter writer = new IndexWriter(d, new StandardAnalyzer(Version.LUCENE_30),

true, IndexWriter.MaxFieldLength.UNLIMITED);

Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);

Document doc = new Document();

field.setValue("Hello students was drive");

doc.add(field);

writer.addDocument(doc);

writer.optimize();

writer.close();

}

public void search() throws IOException {

Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));

IndexReader reader = IndexReader.open(d);

IndexSearcher searcher = new IndexSearcher(reader);

TopDocs docs = searcher.search(new TermQuery(new Term("desc","drove")), 10);

System.out.println(docs.totalHits);

}

public static void main(String[] args) throws IOException {

TestNorms test= new TestNorms();

test.createIndex();

test.search();

}

}

不管是单复数,照样单词的变更,都是没有表现的

不知道是不是分词器的缘故原由?

回答:

切实着实是分词器的问题,StandardAnalyzer并不能进行stemming和lemmatization,因而不能够区分单复数和词型。

文章中讲述的是全文检索的基滥觞基本理,理解了他,有利于更好的理解Lucene,但不代表Lucene是完全按照此基础流程进行的。

(1) 有关stemming

作为stemming,一个闻名的算法是The Porter Stemming Algorithm,其主页为http://tartarus.org/~martin/PorterStemmer/,也可查看其论文http://tartarus.org/~martin/PorterStemmer/def.txt。

经由过程以下网页可以进行简单的测试:Porter's Stemming Algorithm Online[http://facweb.cs.depaul.edu/mobasher/classes/csc575/porter.html]

cars –> car

driving –> drive

tokenization –> token

然而

drove –> drove

可见stemming是经由过程规则缩减为词根的,而不能识别词型的变更。

在最新的Lucene 3.0中,已经有了PorterStemFilter这个类来实现上述算法,只可惜没有Analyzer向匹配,不过没紧要,我们可以简单实现:

public class PorterStemAnalyzer extends Analyzer

{

@Override

public TokenStream tokenStream(String fieldName, Reader reader) {

return new PorterStemFilter(new LowerCaseTokenizer(reader));

}

}

把此分词器用在你的法度榜样中,就能够识别单复数和规则的词型变更了。

public void createIndex() throws IOException {

Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));

IndexWriter writer = new IndexWriter(d, new PorterStemAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

Field field = new Field("desc", "", Field.Store.YES, Field.Index.ANALYZED);

Document doc = new Document();

field.setValue("Hello students was driving cars professionally");

doc.add(field);

writer.addDocument(doc);

writer.optimize();

writer.close();

}

public void search() throws IOException {

Directory d = new SimpleFSDirectory(new File("d:/falconTest/lucene3/norms"));

IndexReader reader = IndexReader.open(d);

IndexSearcher searcher = new IndexSearcher(reader);

TopDocs docs = searcher.search(new TermQuery(new Term("desc", "car")), 10);

System.out.println(docs.totalHits);

docs = searcher.search(new TermQuery(new Term("desc", "drive")), 10);

System.out.println(docs.totalHits);

docs = searcher.search(new TermQuery(new Term("desc", "profession")), 10);

System.out.println(docs.totalHits);

}

(2) 有关lemmatization

至于lemmatization,一样平常是有字典的,方能够由"drove"对应到"drive".

在网上搜了一下,找到European languages lemmatizer[http://lemmatizer.org/],只不过是在linux下面C++开拓的,有兴趣可以试验一下。

首先按照网站的阐明下载,编译,安装:

libMAFSA is the core of the lemmatizer. All other libraries depend on it. Download the last version from the following page, unpack it and compile:

# tar xzf libMAFSA-0.2.tar.gz

# cd libMAFSA-0.2/

# cmake .

# make

# sudo make install

After this you should install libturglem. You can download it at the same place.

# tar xzf libturglem-0.2.tar.gz

# cd libturglem-0.2

# cmake .

# make

# sudo make install

Next you should install english dictionaries with some additional features to work with.

# tar xzf turglem-english-0.2.tar.gz

# cd turglem-english-0.2

# cmake .

# make

# sudo make install

安装完毕后:

/usr/local/include/turglem是头文件,用于编译自己编写的代码

/usr/local/share/turglem/english是字典文件,此中lemmas.xml中我们可以看到"drove"和"drive"的对应,"was"和"be"的对应。

/usr/local/lib中的libMAFSA.alibturglem.alibturglem-english.alibtxml.a是用于天生利用法度榜样的静态库

在turglem-english-0.2目录下有例子测试法度榜样test_utf8.cpp

#include

#include

#include

#include

#include

#include

#include

int main(int argc, char **argv)

{

char in_s_buf[1024];

char *nl_ptr;

tl::lemmatizer lem;

if(argc != 4)

{

printf("Usage: %s words.dic predict.dic flexias.bin\n", argv[0]);

return -1;

}

lem.load_lemmatizer(argv[1], argv[3], argv[2]);

while (!feof(stdin))

{

fgets(in_s_buf, 1024, stdin);

nl_ptr = strchr(in_s_buf, '\n');

if (nl_ptr) *nl_ptr = 0;

nl_ptr = strchr(in_s_buf, '\r');

if (nl_ptr) *nl_ptr = 0;

if (in_s_buf[0])

{

printf("processing %s\n", in_s_buf);

tl::lem_result pars;

size_t pcnt = lem.lemmatize(in_s_buf, pars);

printf("%d\n", pcnt);

for (size_t i = 0; i

{

std::string s;

u_int32_t src_form = lem.get_src_form(pars, i);

s = lem.get_text(pars, i, 0);

printf("PARADIGM %d: normal form '%s'\n", (unsigned int)i, s.c_str());

printf("\tpart of speech:%d\n", lem.get_part_of_speech(pars, (unsigned int)i, src_form));

}

}

}

return 0;

}

编译此文件,并且链接静态库:留意链接顺序,否则可能掉足。

g++ -g -o output test_utf8.cpp -L/usr/local/lib/ -lturglem-english -lturglem -lMAFSA –ltxml

运行编译好的法度榜样:

./output /usr/local/share/turglem/english/dict_english.auto

/usr/local/share/turglem/english/prediction_english.auto

/usr/local/share/turglem/english/paradigms_english.bin

做测试,虽然对其机制尚不甚懂得,然则可以看到lemmatization的感化:

drove

processing drove

3

PARADIGM 0: normal form 'DROVE'

part of speech:0

PARADIGM 1: normal form 'DROVE'

part of speech:2

PARADIGM 2: normal form 'DRIVE'

part of speech:2

was

processing was

3

PARADIGM 0: normal form 'BE'

part of speech:3

PARADIGM 1: normal form 'BE'

part of speech:3

PARADIGM 2: normal form 'BE'

part of speech:3

您可能还会对下面的文章感兴趣: