Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
474 views
in Technique[技术] by (71.8m points)

python - Vectorize finding closest value in an array for each element in another array

Input

known_array : numpy array; consisting of scalar values only; shape: (m, 1)

test_array : numpy array; consisting of scalar values only; shape: (n, 1)

Output

indices : numpy array; shape: (n, 1); For each value in test_array finds the index of the closest value in known_array

residual : numpy array; shape: (n, 1); For each value in test_array finds the difference from the closest value in known_array

Example

In [17]: known_array = np.array([random.randint(-30,30) for i in range(5)])

In [18]: known_array
Out[18]: array([-24, -18, -13, -30,  29])

In [19]: test_array = np.array([random.randint(-10,10) for i in range(10)])

In [20]: test_array
Out[20]: array([-6,  4, -6,  4,  8, -4,  8, -6,  2,  8])

Sample Implementation (Not fully vectorized)

def find_nearest(known_array, value):
    idx = (np.abs(known_array - value)).argmin()
    diff = known_array[idx] - value
    return [idx, -diff]

In [22]: indices = np.zeros(len(test_array))

In [23]: residual = np.zeros(len(test_array))

In [24]: for i in range(len(test_array)):
   ....:     [indices[i], residual[i]] =  find_nearest(known_array, test_array[i])
   ....:     

In [25]: indices
Out[25]: array([ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.])

In [26]: residual
Out[26]: array([  7.,  17.,   7.,  17.,  21.,   9.,  21.,   7.,  15.,  21.])

What is the best way to speed up this task? Cython is an option, but, I would always prefer to be able to remove the for loop and let the code remain are pure NumPy.


NB: Following Stack Overflow questions were consulted

  1. Python/Numpy - Quickly Find the Index in an Array Closest to Some Value
  2. Find the index of numerically closest value
  3. Find nearest value in numpy array
  4. Finding the nearest value and return the index of array in Python
  5. finding nearest items across two lists/arrays in Python

Updates

I did some small benchmarks for comparing the non-vectorized and vectorized solution (accepted answer).

In [48]: [indices1, residual1] = find_nearest_vectorized(known_array, test_array)

In [53]: [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)

In [54]: indices1==indices2
Out[54]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],   dtype=bool)

In [55]: residual1==residual2
Out[55]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [56]: %timeit [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)
10000 loops, best of 3: 173 μs per loop

In [57]: %timeit [indices1, residual1] = find_nearest_vectorized(known_array, test_array)
100000 loops, best of 3: 16.8 μs per loop

About a 10-fold speedup!

Clarification

known_array is not sorted.

I ran the benchmarks as given in the answer by @cyborg below.

Case 1: If known_array were sorted

known_array = np.arange(0,1000)
test_array = np.random.randint(0, 100, 10000)
print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
    print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
    assert np.allclose(base(known_array, test_array), eval(func_name+'(known_array, test_array)'))

Speedups:
diffs is x0.4 faster than base.
searchsorted1 is x81.3 faster than base.
searchsorted2 is x107.6 faster than base.

Firstly, for large arrays diffs method is actually slower, it also eats up a lot of RAM and my system hanged when I ran it on actual data.

Case 2 : When known_array is not sorted; which represents actual scenario

known_array = np.random.randint(0,100,100)
test_array = np.random.randint(0, 100, 100)

Speedups:
diffs is x8.9 faster than base.
AssertionError                            Traceback (most recent call last)
<ipython-input-26-3170078c217a> in <module>()
      5 for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
      6     print func_name + ' is x%.1f faster than base.' % (base_time /  time_f(func_name))
----> 7     assert np.allclose(base(known_array, test_array),  eval(func_name+'(known_array, test_array)'))

AssertionError: 


searchsorted1 is x14.8 faster than base.

I must also comment that the approach should also be memory efficient. Otherwise my 8 GB of RAM is not sufficient. In the base case, it is easily sufficient.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If the array is large, you should use searchsorted:

import numpy as np
np.random.seed(0)
known_array = np.random.rand(1000)
test_array = np.random.rand(400)

%%time
differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))
indices = np.abs(differences).argmin(axis=0)
residual = np.diagonal(differences[indices,])

output:

CPU times: user 11 ms, sys: 15 ms, total: 26 ms
Wall time: 26.4 ms

searchsorted version:

%%time

index_sorted = np.argsort(known_array)
known_array_sorted = known_array[index_sorted]

idx1 = np.searchsorted(known_array_sorted, test_array)
idx2 = np.clip(idx1 - 1, 0, len(known_array_sorted)-1)

diff1 = known_array_sorted[idx1] - test_array
diff2 = test_array - known_array_sorted[idx2]

indices2 = index_sorted[np.where(diff1 <= diff2, idx1, idx2)]
residual2 = test_array - known_array[indices]

output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 311 μs

We can check that the results is the same:

assert np.all(residual == residual2)
assert np.all(indices == indices2)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...