Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
579 views
in Technique[技术] by (71.8m points)

matlab - Time series aggregation efficiency

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.

Take the aggregation function:

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
    for iSer = 1:size(array, 2)
      aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
    end
end

end

Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.

Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:

a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.

Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).

Compare this with the period-indifferent case:

tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.

Is there a more efficient way to aggregate this data?


Timing Results

Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:

    original | 1.32451
      felix1 | 0.35446
      felix2 | 0.16432
    divakar1 | 0.41905
    divakar2 | 0.30509
    divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977

Clarification on groupIndex

An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:

a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];

Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Method #1

You can create the mask corresponding to grIdx across all groups in one go with bsxfun(@eq,..). Now, for collapseFn as @sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -

M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array

For collapseFn as @mean, you need to do a bit more work, as shown here -

M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array

Method #2

In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -

n = size(groups,1);
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
    out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end

Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.


Bonus

By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -

dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'

m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...