Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
259 views
in Technique[技术] by (71.8m points)

python - Longest Prefix Matches for URLs

I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

Thanks a lot.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Performance comparison

suffixtree vs. pytrie vs. trie vs. datrie vs. startswith -functions

Setup

The recorded time is a minimum time among 3 repetitions of 1000 searches. A trie construction time is included and spread among all searches. The search is performed on collections of hostnames from 1 to 1000000 items.

Three types of a search string:

  • non_existent_key - there is no match for the string
  • rare_key - around 20 in a million
  • frequent_key - number of occurrences is comparable to the collection size

Results

Maximum memory consumption for a million urls:
| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

To reproduce the results, run the trie benchmark code.

  • rare_key/nonexistent_key case

    if number of urls is less than 10000 then datrie is the fastest, for N>10000 - suffixtree is faster, startwith is significally slower on average.

rare_key

  • axes:

    • vertical (time) scale is ~1 second (2**20 microseconds)
    • horizontal axis shows total number of urls in each case: N= 1, 10, 100, 1000, 10000, 100000, and 1000000 (a million).

nonexistent_key

  • frequent_key

    Upto N=100000 datrie is the fastest (for a million urls the time is dominated by the trie construction time).

    The most time is taken by finding the longest match among found matches. So all functions behave similar as expected.

frequent_key

startswith - time performance is independent from type of key.

trie and pytrie behave similar to each other.

Performance without trie construction time

  • datrie - the fastest, decent memory consumption

  • startswith is even more at disadvantage here because other approaches are not penalized by the time it takes to build a trie.

  • datrie, pytrie, trie - almost O(1) (constant time) for rare/non_existent key

rare_key_no_trie_build_time nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

Fitting (approximating) polynoms of known functions for comparison (same log/log scale as in figures):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...