python - Most efficient way to sum huge 2D NumPy array, grouped by ID column?

Question

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a massive data array (500k rows) that looks like:

id  value  score
1   20     20
1   10     30
1   15     0
2   12     4
2   3      8
2   56     9
3   6      18
...

As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id

With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score       = sum(table_trunc[:,2])

Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:06:54+0000

you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.