Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
599 views
in Technique[技术] by (71.8m points)

performance - Python: count occurrences in a list using dict comprehension/generator

I want to write some tests to analyse the efficiency of different operations in python, namely a comparison of dictionary comprehensions and dict generators.

To test this out, I thought I would try a simple example: count the number of words in a list using dictionaries.

Now I know that you can do this using collections.Counter (as per an answer here: How can I count the occurrences of a list item in Python?), but my objective was to test performance an memory.

One "long-hand" way is to do it in a basic loop.

from pprint import pprint

# Read in some text to create example data
with open('text.txt') as f:
    words = f.read().split()

dict1 = {}
for w in words:
    if not dict1.get(w):
        dict1[w] = 1
    else:
        dict1[w] += 1
pprint(dict1)

The result:

{'a': 62,
 'aback': 1,
 'able': 1,
 'abolished': 2,
 'about': 6,
 'accept': 1,
 'accepted': 1,
 'accord': 1,
 'according': 1,
 'across': 1,
 ...

Then I got a bit stuck trying to do the same in a dictionary comprehension:

dict2  = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
            for w in words }

I got an error:

NameError: global name 'dict2' is not defined

I tried defining the dict up front:

dict2 = {}
dict2  = { w: 1 if not dict2.get(w) else dict2.get(w) + 1
            for w in words }
pprint(dict2)

But of course the counts are all set to 1:

{'a': 1,
 'aback': 1,
 'able': 1,
 'abolished': 1,
 'about': 1,
 'accept': 1,
 'accepted': 1,
 'accord': 1,
 'according': 1,
 'across': 1,
 ...

I had a similar problem with dict comprehension:

dict3 = dict( (w, 1 if not dict2.get(w) else dict2.get(w) + 1)
                for w in words)

So my question is: how can I use a dictionary comprehension/generator most efficiently to count the number of occurrences in a list?

Update: @Rawing suggested an alternative approach {word:words.count(word) for word in set(words)} but that would circumvent the mechanism I am trying to test.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You cannot do this efficiently(at least in terms of memory) using a dict-comprehension, because then you'll have to keep track of current count in another dictionary i.e more memory consumption. Here's how you can do it using a dict-comprehension(not recommended at all :-)):

>>> words = list('asdsadDASDFASCSAASAS')
>>> dct = {}
>>> {w: 1 if w not in dct and not dct.update({w: 1})
                  else dct[w] + 1
                  if not dct.update({w: dct[w] + 1}) else 1 for w in words}
>>> dct
{'a': 2, 'A': 5, 's': 2, 'd': 2, 'F': 1, 'C': 1, 'S': 5, 'D': 2}

Another way will be to sort the words list first then group them using itertools.groupby and then count the length of each group. Here the dict-comprehension can be converted to a generator if you want, but yes this will require reading all words in memory first:

from itertools import groupby
words.sort()
dct = {k: sum(1 for _ in g) for k, g in groupby(words)}

Note that the fastest one of the lot is collections.defaultdict:

d = defaultdict(int)
for w in words: d[w] += 1 

Timing comparisons:

>>> from string import ascii_letters, digits
>>> %timeit words = list(ascii_letters+digits)*10**4; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
10 loops, best of 3: 131 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; Counter(words)
10 loops, best of 3: 169 ms per loop
>>> %timeit words = list(ascii_letters+digits)*10**4; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}
1 loops, best of 3: 315 ms per loop
>>> %%timeit
... words = list(ascii_letters+digits)*10**4
... d = defaultdict(int)
... for w in words: d[w] += 1
... 
10 loops, best of 3: 57.1 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**4
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
10 loops, best of 3: 108 ms per loop

#Increase input size 

>>> %timeit words = list(ascii_letters+digits)*10**5; words.sort(); {k: sum(1 for _ in g) for k, g in groupby(words)}
1 loops, best of 3: 1.44 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; Counter(words)
1 loops, best of 3: 1.7 s per loop
>>> %timeit words = list(ascii_letters+digits)*10**5; dct = {}; {w: 1 if w not in dct and not dct.update({w: 1}) else dct[w] + 1 if not dct.update({w: dct[w] + 1}) else 1 for w in words}

1 loops, best of 3: 3.19 s per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = defaultdict(int)
for w in words: d[w] += 1
... 
1 loops, best of 3: 571 ms per loop
>>> %%timeit
words = list(ascii_letters+digits)*10**5
d = {}
for w in words: d[w] = d.get(w, 0) + 1
... 
1 loops, best of 3: 1.1 s per loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...