Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
120 views
in Technique[技术] by (71.8m points)

python - how to apply a generic function over numpy rows?

Before you flag this as duplicate, let me explain to you that I read this page and many others and I still haven't found a solution to my problem.

This is the problem I'm having: given two 2D arrays, I want to apply a function F over the two arrays. F takes as input two 1D arrays.

import numpy as np
a = np.arange(15).reshape([3,5])
b = np.arange(30, step=2).reshape([3,5])

# what is the 'numpy' equivalent of the following?
np.array([np.dot(x,y) for x,y in zip(a,b)])

Please note that np.dot is just for demonstration. The real question here is any generic function F that works over two sets of 1D arrays.

  • vectorizing either fails outright with an error or it applies the function element-by-element, instead of array-by-array (or row-by-row)
  • np.apply_along_axis applies the function iteratively; for example, using the variables defined above, it does F(a[0], b[0]) and combines this with F(a[0], b[1]) and F(a[0], b[2]). This is not what I'm looking for. Ideally, I would want it to stop at just F(a[0], b[0])
  • index slicing / advanced slicing doesn't do what I would like either. For one, if I do something like np.dot(a[np.arange(3)], b[np.arange(3)]) this throws a ValueError saying that shapes (3,5) and (3,5) are not aligned. I don't know how to fix this.

I tried to solve this in any way I could, but the only solution I've come up with that works is using list comprehension. But I'm worried about the cost to performance as a result of using list comprehension. I would like to achieve the same effect using a numpy operation, if possible. How do I do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This type of question has been beat to death on SO, but I'll try to illustrate the issues with your framework:

In [1]: a = np.arange(15).reshape([3,5])
   ...: b = np.arange(30, step=2).reshape([3,5])
   ...: 
In [2]: def f(x,y):
   ...:     return np.dot(x,y)

zipped comprehension

The list comprehension approach applies f to the 3 rows of a and b. That is, it iterates on the 2 arrays as through they were lists. At each call, your function gets 2 1d arrays. dot can accept other shapes, but for the moment we'll pretend that it only works with a pair of 1ds

In [3]: np.array([f(x,y) for x,y in zip(a,b)])
Out[3]: array([  60,  510, 1460])
In [4]: np.dot(a[0],b[0])
Out[4]: 60

vectorize/frompyfunc

np.vectorize iterates over the inputs (with broadcasting - which can be handy), and gives the function scalar values. I'll illustrate with frompyfunc returns a object dtype array (and is used by vectorize):

In [5]: vf = np.frompyfunc(f, 2,1)
In [6]: vf(a,b)
Out[6]: 
array([[0, 2, 8, 18, 32],
       [50, 72, 98, 128, 162],
       [200, 242, 288, 338, 392]], dtype=object)

So the result is (3,5) array; incidentally summing across columns gets the desired result

In [9]: vf(a,b).sum(axis=1)
Out[9]: array([60, 510, 1460], dtype=object)

np.vectorize does not make any speed promises.

apply_along_axis

I don't know how you tried to use apply_along_axis. It only takes one array. After a lot of set up it ends up doing (for a 2d array like a):

for i in range(3):
    idx = (i, slice(None))
    outarr[idx] = asanyarray(func1d(arr[idx], *args, **kwargs))

For 3d and larger it makes iteration over the 'other' axes simpler; for 2d it is overkill. In any case it does not speed up the calculations. It is still iteration.

(apply_along_axis takes arr and *args. It iterates on arr, but uses *args whole.).

indexing

np.dot(a[np.arange(3)], b[np.arange(3)])

is the same as

np.dot(a, b)

dot is matrix product, (3,5) works with (5,3) to produce a (3,3). It handles 1d as a special case (see docs), (3,) with (3,) produces (3,).

iteration

For a truly generic f(x,y), your only alternative to the zipped list comprehension is an index loop like this:

In [18]: c = np.zeros((a.shape[0]))
In [19]: for i in range(a.shape[0]):
    ...:    c[i] = f(a[i,:], b[i,:])
In [20]: c
Out[20]: array([   60.,   510.,  1460.])

Speed will be similar. (that action can be moved to compiled code with cython, but I don't think you are ready to dive in that deep.)

As noted in a comment, if the arrays are (N,M), and N is small compared to M, this iteration is not costly. That is, a few loops over a big task are ok. They may even be faster if they simplify large array memory management.

best

The ideal solution is to rewrite the generic function so it works with 2d arrays, using numpy compilied functions.

In the matrix multiplication case, einsum has implemented a generalized form of 'sum-of-products' in compiled code:

In [22]: np.einsum('ij,ij->i',a,b)
Out[22]: array([  60,  510, 1460])

matmul also generalizes the product, but works best with 3d arrays:

In [25]: a[:,None,:]@b[:,:,None]    # needs reshape
Out[25]: 
array([[[  60]],

       [[ 510]],

       [[1460]]])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...