Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
826 views
in Technique[技术] by (71.8m points)

algorithm - How does lucene index documents?

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

On the above link, it says Lucene uses this algorithm for indexing:

  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=8

for (size = 1; size < M; size *= b) {
    if (there are b indexes with size docs on top of the stack) {
        pop them off the stack;
        merge them into a single index;
        push the merged index onto the stack;
    } else {
        break;
    }
}

How does this algorithm provide optimized indexing?

Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees).

So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.

Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.

For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...