Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
287 views
in Technique[技术] by (71.8m points)

python - Matrix multiplication with iterator dependency - NumPy

Sometime back this question (now deleted but 10K+ rep users can still view it) was posted. It looked interesting to me and I learnt something new there while trying to solve it and I thought that's worth sharing. I would like to post those ideas/solutions and would love to see people post other possible ways to solve it. I am posting the gist of the question next.

So, we have two NumPy ndarrays a and b of shapes :

a : (m,n,N)
b : (n,m,N)

Let's assume we are dealing with cases where m,n & N are comparable.

The problem is to solve the following multiplication and summation with focus on performance :

def all_loopy(a,b):
    P,Q,N = a.shape
    d = np.zeros(N)
    for i in range(N):
        for j in range(i):
            for k in range(P):
                for n in range(Q):
                    d[i] += a[k,n,i] * b[n,k,j]
    return d
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I learnt few things along the way trying to find vectorized and faster ways to solve it.

1) First off, there is a dependency of iterators at "for j in range(i)". From my previous experience, especially with trying to solve such problems on MATLAB, it appeared that such dependency could be taken care of with a lower triangular matrix, so np.tril should work there. Thus, a fully vectorized solution and not so memory efficient solution (as it creates an intermediate (N,N) shaped array before finally reducing to (N,) shaped array) would be -

def fully_vectorized(a,b):
    return np.tril(np.einsum('ijk,jil->kl',a,b),-1).sum(1)

2) Next trick/idea was to keep one loop for the iterator i in for i in range(N), but insert that dependency with indexing and using np.einsum to perform all those multiplications and summations. The advantage would be memory-efficiency. The implementation would look like this -

def einsum_oneloop(a,b):
    d = np.zeros(N)
    for i in range(N):
        d[i] = np.einsum('ij,jik->',a[:,:,i],b[:,:,np.arange(i)])
    return d

There are two more obvious ways to solve it. So, if we start working from the original all_loopy solution, one could keep the outer two loops and use np.einsum or np.tensordot to perform those operations and thus remove the inner two loops, like so -

def tensordot_twoloop(a,b):
    d = np.zeros(N)
    for i in range(N):
        for j in range(i):
            d[i] += np.tensordot(a[:,:,i],b[:,:,j], axes=([1,0],[0,1]))        
    return d

def einsum_twoloop(a,b):
    d = np.zeros(N)
    for i in range(N):
        for j in range(i):
            d[i] += np.einsum('ij,ji->',a[:,:,i],b[:,:,j])
    return d

Runtime test

Let's compare all the five approaches posted thus far to solve the problem, including the one posted in the question.

Case #1 :

In [26]: # Input arrays with random elements
    ...: m,n,N = 20,20,20
    ...: a = np.random.rand(m,n,N)
    ...: b = np.random.rand(n,m,N)
    ...: 

In [27]: %timeit all_loopy(a,b)
    ...: %timeit tensordot_twoloop(a,b)
    ...: %timeit einsum_twoloop(a,b)
    ...: %timeit einsum_oneloop(a,b)
    ...: %timeit fully_vectorized(a,b)
    ...: 
10 loops, best of 3: 79.6 ms per loop
100 loops, best of 3: 4.97 ms per loop
1000 loops, best of 3: 1.66 ms per loop
1000 loops, best of 3: 585 μs per loop
1000 loops, best of 3: 684 μs per loop

Case #2 :

In [28]: # Input arrays with random elements
    ...: m,n,N = 50,50,50
    ...: a = np.random.rand(m,n,N)
    ...: b = np.random.rand(n,m,N)
    ...: 

In [29]: %timeit all_loopy(a,b)
    ...: %timeit tensordot_twoloop(a,b)
    ...: %timeit einsum_twoloop(a,b)
    ...: %timeit einsum_oneloop(a,b)
    ...: %timeit fully_vectorized(a,b)
    ...: 
1 loops, best of 3: 3.1 s per loop
10 loops, best of 3: 54.1 ms per loop
10 loops, best of 3: 26.2 ms per loop
10 loops, best of 3: 27 ms per loop
10 loops, best of 3: 23.3 ms per loop

Case #3 (Leaving out all_loopy for being very slow) :

In [30]: # Input arrays with random elements
    ...: m,n,N = 100,100,100
    ...: a = np.random.rand(m,n,N)
    ...: b = np.random.rand(n,m,N)
    ...: 

In [31]: %timeit tensordot_twoloop(a,b)
    ...: %timeit einsum_twoloop(a,b)
    ...: %timeit einsum_oneloop(a,b)
    ...: %timeit fully_vectorized(a,b)
    ...: 
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 744 ms per loop
1 loops, best of 3: 568 ms per loop
1 loops, best of 3: 866 ms per loop

Going by the numbers, einsum_oneloop looks pretty good to me, whereas fully_vectorized could be used when dealing with small to decent sized arrays!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...