Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
496 views
in Technique[技术] by (71.8m points)

r - How can I keep track of total transaction amount sent from an account each last 6 month?

This is my transaction data

data 

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
19521       6644    6934        2005-01-01  700.0
19524       6753    8456        2005-01-01  600.0
19523       9242    9333        2005-01-01  1000.0
…           …       …           …           …
1055597     9866    9736        2010-12-31  278.9
1053519     9868    8644        2010-12-31  242.8
1052790     9869    8399        2010-12-31  372.2

Now for each distinct account in from column, I want to keep track of how much transaction amount they sent over last 6 month at the time the transaction was made and so I want to do it according to the transaction date at which the particular transaction was made.

To see it better I will only consider the account 5370 here. So, then let's consider the following data:

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
18529       5370    9356        2005-05-31  24.4
13742       5370    5605        2005-08-05  7618.0
9913        5370    8567        2005-09-12  21971.0
2557        5370    5636        2005-11-12  2921.0
18669       5370    8933        2005-11-30  169.2
35900       5370    8483        2006-01-31  71.5
51341       5370    7626        2006-04-11  4214.0
83324       5370    9676        2006-08-31  261.1
100277      5370    9105        2006-10-31  182.0
103444      5370    9772        2006-11-08  16927.0

The very first transaction 5370 made was on 2005-05-31. So there's no any record before that. That's why this is the starting date point for 5370(So, each distinct account will have their own starting date point based on which date they made their first transaction). Thus, total transaction amount sent by 5370 in last 6 month at that time was just 24.4. Going to the next transaction of 5370, there comes the second transaction made on 2005-08-05. At that time, total transaction amount sent by 5370 in last 6 month was 24.4 + 7618.0 = 7642.4. So, the output should be as follows:

id          from    to          date        amount     total_trx_amount_sent_in_last_6month_by_from
<int>       <fctr>  <fctr>      <date>      <dbl>      <dbl>
18529       5370    9356        2005-05-31  24.4       24.4 
13742       5370    5605        2005-08-05  7618.0     (24.4+7618.0)=7642.4
9913        5370    8567        2005-09-12  21971.0    (24.4+7618.0+21971.0)=29613.4
2557        5370    5636        2005-11-12  2921.0     (24.4+7618.0+21971.0+2921.0)=32534.4
18669       5370    8933        2005-11-30  169.2      (7618.0+21971.0+2921.0+169.2)=32679.2
35900       5370    8483        2006-01-31  71.5       (7618.0+21971.0+2921.0+169.2+71.5)=32750.7
51341       5370    7626        2006-04-11  4214.0     (2921.0+169.2+71.5+4214.0)=7375.7
83324       5370    9676        2006-08-31  261.1      (4214.0+261.1)=4475.1
100277      5370    9105        2006-10-31  182.0      (261.1+182.0)=443.1
103444      5370    9772        2006-11-08  16927.0    (261.1+182.0+16927.0)=17370.1

For the calculations, I subtracted 180 days(approx. 6 months) from the transaction date on each line. That's how I chose which amounts should be summed up.

So, how can I achieve this for the whole data, considering all the distinct accounts?

PS: My data has 1 million rows so the solution also should run faster on a large dataset.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A way using dplyr could be :

library(dplyr)
df %>%
  group_by(from) %>%
  mutate(total_trx = purrr::map_dbl(date, 
                     ~sum(amount[between(date, .x - 180, .x)])))

#      id  from    to date        amount total_trx
#    <int> <int> <int> <date>       <dbl>     <dbl>
# 1  18529  5370  9356 2005-05-31    24.4      24.4
# 2  13742  5370  5605 2005-08-05  7618      7642. 
# 3   9913  5370  8567 2005-09-12 21971     29613. 
# 4   2557  5370  5636 2005-11-12  2921     32534. 
# 5  18669  5370  8933 2005-11-30   169.    32679. 
# 6  35900  5370  8483 2006-01-31    71.5   32751. 
# 7  51341  5370  7626 2006-04-11  4214      7376. 
# 8  83324  5370  9676 2006-08-31   261.     4475. 
# 9 100277  5370  9105 2006-10-31   182       443. 
#10 103444  5370  9772 2006-11-08 16927     17370. 

If you are data is huge you can use the above approach in data.table which might be efficient.

library(data.table)
setDT(df)[, total_trx := sapply(date, function(x) 
                         sum(amount[between(date, x - 180, x)])), from]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...