Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
151 views
in Technique[技术] by (71.8m points)

python - numpy: efficient, large dot products

I am trying to perform a large linear-algebra computation to transform a generic covariance matrix KK_l_obs (shape (NL, NL))into a map of covariance matrices in a reduced space Kmap_PC (shape (q, q, X, Y)).

Information about how to construct Kmap_PC for each spatial location is held in other arrays a, I0, and k_l_th. The first two have shapes (X, Y), and the third (nl, nl). The transformation between the observed and reduced space is handed by eingenvectors E (shape (q, nl)). Note that NL > nl.

A spatial element of Kmap_PC is calculated as:

Kmap_PC[..., X, Y] = E.dot(
    KK_l_obs[I0[X, Y]: I0[X, Y] + nl,
             I0[X, Y]: I0[X, Y] + nl] / a_map[X, Y] + 
    k_l_th).dot(E.T)

The bit inside the first dot product could theoretically be computed straight using np.einsum, but would take up hundreds of GB of memory. What I am doing now is looping through the spatial indices of Kmap_PC, which is pretty slow. I could also distribute the calculation using MPI (which could probably give a speedup of 3-4x, since I have 16 cores available).

I'm wondering:

(a) if I can do the computation more efficiently--perhaps explicitly breaking it down into groups of spatial elements; and

(b) if I can improve the memory overhead for those calculations.

Code snippet

import numpy as np
np.random.seed(1)

X = 10
Y = 10
NL = 3000
nl = 1000
q = 7

a_map = 5. * np.random.rand(X, Y)
E = np.random.randn(q, nl)

# construct constant component
m1_ = .05 * np.random.rand(nl, nl)
k_l_th = m1_.dot(m1_)

# construct variable component
m2_ = np.random.rand(NL, NL)
KK_l_obs = m2_.dot(m2_.T)

# where to start in big cov
I0 = np.random.randint(0, NL - nl, (X, Y))

# the slow way
def looping():
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))

    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] / a_map[si[0], si[1]] + k_l_th).dot(E.T)

    return K_PC

def veccalc():
    nl_ = np.arange(nl)[..., None, None]
    I, J = np.meshgrid(nl_, nl_)

    K_s = KK_l_obs[I0[..., None, None] + J, I0[..., None, None] + I]
    K_s = K_s / a_map[..., None, None] + k_l_th[None, None, ...]
    print(K_s.nbytes)

    K_PC = E @ K_s @ E.T
    K_PC = np.moveaxis(K_PC, [0, 1], [-2, -1])

    return K_PC
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Tweak #1

One very simple performance tweak that's mostly ignored in NumPy is avoiding the use of division and using multiplication. This is not noticeable when dealing with scalar to scalar or array to array divisions when dealing with equal shaped arrays. But NumPy's implicit broadcasting makes it interesting for divisions that allow for broadcasting between arrays of different shapes or between an array and scalar. For those cases, we could get noticeable boost using multiplication with the reciprocal numbers. Thus, for the stated problem, we would pre-compute the reciprocal of a_map and use those for multiplications in place of divisions.

So, at the start do :

r_a_map = 1.0/a_map

Then, within the nested loops, use it as :

KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * r_a_map[si[0], si[1]]

Tweak #2

We could use associative property of multiplication there :

A*(B + C) = A*B + A*C

Thus, k_l_th that is summed across all iterations but stays constant could be taken outside of the loop and summed up after getting out of the nested loops. It's effective summation would be : E.dot(k_l_th).dot(E.T). So, we would add this to K_PC.


Finalizing and benchmarking

Using tweak #1 and tweak#2, we would end up with a modified approach, like so -

def original_mod_app():
    r_a_map = 1.0/a_map
    K_PC = np.empty((q, q, X, Y))
    inds = np.ndindex((X, Y))
    for si in inds:
        I0_ = I0[si[0], si[1]]
        K_PC[..., si[0], si[1]] = E.dot(
            KK_l_obs[I0_ : I0_ + nl, I0_ : I0_ + nl] * 
            r_a_map[si[0], si[1]]).dot(E.T)
    return K_PC + E.dot(k_l_th).dot(E.T)[:,:,None,None]

Runtime test with the same sample setup as used in the question -

In [458]: %timeit original_app()
1 loops, best of 3: 1.4 s per loop

In [459]: %timeit original_mod_app()
1 loops, best of 3: 677 ms per loop

In [460]: np.allclose(original_app(), original_mod_app())
Out[460]: True

So, we are getting a speedup of 2x+ there.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...