Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
645 views
in Technique[技术] by (71.8m points)

python - Match rows of two 2D arrays and get a row indices map using numpy

Suppose you have two 2D arrays A and B, and you want to check, where a row of A is contained in B. How do you do this most efficiently using numpy?

E.g.

a = np.array([[1,2,3],
              [4,5,6],
              [9,10,11]])

b = np.array([[4,5,6],
              [4,3,2],
              [1,2,3],
              [4,8,9]])
map = [[0,2], [1,0]]  # row 0 of a is at row index 2 of array B

I know how to check if a row of A is in B using in1d (test for membership in a 2d numpy array), but this does not yield the indices map.

The purpose of this map is to (finally) merge the two arrays together based on some columns.
Of course one could do this row by row, but this gets very inefficient, since my arrays have the shape (50 Mio., 20).

An alternative would be to use the pandas merge function, but I'd like to do this using numpy only.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Approach #1

Here's one based on views. Makes use of np.argwhere (docs) to return the indices of an element that meet a condition, in this case, membership. -

def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

def argwhere_nd(a,b):
    A,B = view1D(a,b)
    return np.argwhere(A[:,None] == B)

Approach #2

Here's another that would be O(n) and hence much better on performance, especially on large arrays -

def argwhere_nd_searchsorted(a,b):
    A,B = view1D(a,b)
    sidxB = B.argsort()
    mask = np.isin(A,B)
    cm = A[mask]
    idx0 = np.flatnonzero(mask)
    idx1 = sidxB[np.searchsorted(B,cm, sorter=sidxB)]
    return idx0, idx1 # idx0 : indices in A, idx1 : indices in B

Approach #3

Another O(n) one using argsort() -

def argwhere_nd_argsort(a,b):
    A,B = view1D(a,b)
    c = np.r_[A,B]
    idx = np.argsort(c,kind='mergesort')
    cs = c[idx]
    m0 = cs[:-1] == cs[1:]
    return idx[:-1][m0],idx[1:][m0]-len(A)

Sample runs with same inputs as earlier -

In [650]: argwhere_nd_searchsorted(a,b)
Out[650]: (array([0, 1]), array([2, 0]))

In [651]: argwhere_nd_argsort(a,b)
Out[651]: (array([0, 1]), array([2, 0]))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...