Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
601 views
in Technique[技术] by (71.8m points)

solr - Understanding Lucene leading wildcard performance

Lucene does not by default allow leading wildcards in search terms, but this can be enabled with:

QueryParser#setAllowLeadingWildcard(true)

I understand that use of a leading wildcard prevents Lucene from using the index. Searches with a leading wildcard must scan the entire index.

How do I demonstrate the performance of a leading wildcard query? When is it OK to use setAllowLeadingWildcard(true)?

I have built a test index with 10 million documents in the form:

{ name: random_3_word_phrase }

The index is 360M on disk.

My test queries perform well and I have been unable to actually demonstrate a performance problem. For example, querying for name:*ing produces over 1.1 million documents in less than 1 second. Querying name:*ing* produces over 1.5 million documents in the same time.

What is going here? Why isn't this slow? Is 10,000,000 documents not enough? Do the documents need to contains more than just a single field?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Depends on how much memory you have, and how much of the token index is in memory.

A 360MB total index could be searched quite quickly on any old computer. A 360GB index would take a bit longer... ;)

As an example, I fired up an old 2GB index, and searched for "*e".

On a box with 8GB, it returned 500K hits in under 5 seconds. I tried the same index on a box with only 1GB of memory, and it took about 20 seconds.

To illustrate further, here's some generic C# code that basically does a "** E*" type search of 10 million random 3 word phrases.

static string substring = "E";

private static Random random = new Random((int)DateTime.Now.Ticks);//thanks to McAden

private static string RandomString(int size)
{
    StringBuilder builder = new StringBuilder();
    char ch;
    for (int i = 0; i < size; i++)
    {
        ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
        builder.Append(ch);
    }

    return builder.ToString();
}

static void FindSubStringInPhrases()
{
    List<string> index = new List<string>();

    for (int i = 0; i < 10000000; i++)
    {
        index.Add(RandomString(5) + " " + RandomString(5) + " " + RandomString(5));
    }

    var matches = index.FindAll(SubstringPredicate);

}

static bool SubstringPredicate(string item)
{
    if (item.Contains(substring))
        return true;
    else
        return false;
}

After all 10 million phases have been loaded into the list, it still only takes about a second for "var matches = index.FindAll(SubstringPredicate);" to return over 4 million hits.

The point is, memory is fast. Once things can no longer fit into memory and you have to start swapping to disk is when you are going to see performance hits.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...