Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
852 views
in Technique[技术] by (71.8m points)

algorithm - Trie (Prefix Tree) in Python

I don't know if this is the place to ask about algorithms. But let's see if I get any answers ... :)

If anything is unclear I'm very happy to clarify things.

I just implemented a Trie in python. However, one bit seemed to be more complicated than it ought to (as someone who loves simplicity). Perhaps someone has had a similar problem?

My aim was to minimize the number of nodes by storing the largest common prefix of a sub-trie in its root. For example, if we had the words stackoverflow, stackbase and stackbased, then the tree would look something like this:

              [s]tack
[o]verflow ______/ \_______ [b]ase
                                  \___ [d]

Note that one can still think of the edges having one character (the first one of the child node).

Find-query is simple to implement. Insertion is not hard, but somewhat more complex than I want.. :(

My idea was to insert the keys one after the other (starting from an empty trie), by first searching for the to-be-inserted key k (Find(k)), and then rearranging/splitting the nodes locally at the place where the find-procedure stops. There turn out to be 4 cases: (Let k be the key we want to insert, and k' be the key of the node, where the search ended)

  1. k is identical to k'
  2. k is a "proper" prefix of k'
  3. k' is a "proper" prefix of k
  4. k and k' share some common prefix, but none of the cases (1), (2) or (3) occur.

It seems that each of the cases are unique and thus imply different modifications of the Trie. BUT: is it really that complex? Am I missing something? Is there a better approach?

Thanks :)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

At a glance, it sounds like you've implemented a Patricia Trie. This approach also is called path compression in some of the literature. There should be copies of that paper that aren't behind the ACM paywall, which will include an insertion algorithm.

There's also another compression method you may want to look at: level compression. The idea behind path compression is to replace strings of single child nodes with a single super node that has a "skip" count. The idea behind level compression is to replace full or nearly full subtrees with a super node with a "degree" count that says how many digits of the key the node decodes. There's also a 3rd approach called width compression, but I'm afraid my memory fails me and I couldn't find a description of it with quick googling.

Level compression can shorten the average path considerably, but insertion and removal algorithms get quite complicated as they need to manage the trie nodes as similarly to dynamic arrays. For the right data sets, level compressed trees can be fast. From what I remember, they're the 2nd fastest approach for storing IP routing tables, the fastest is some sort of hash trie.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...