Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
106 views
in Technique[技术] by (71.8m points)

python - Determining duplicate values in an array

Suppose I have an array

a = np.array([1, 2, 1, 3, 3, 3, 0])

How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.

I've come up with a few methods that appear to work:

Masking

m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]

Set operations

a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]

This one is cute but probably illegal (as a isn't actually unique):

np.setxor1d(a, np.unique(a), assume_unique=True)

Histograms

u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]

Sorting

s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]

Pandas

s = pd.Series(a)
s[s.duplicated()]

Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).


Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

a = np.random.randint(10**7, size=10**7)

The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.

I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:

u, c = np.unique(a, return_counts=True)
dup = u[c > 1]

This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.

It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...