Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
525 views
in Technique[技术] by (71.8m points)

data.table - R- How can I update column values only pertaining to rows where column contains beginning of a certain number?

I have a dataset that looks as follows:

ex <- data.frame("id" =c(rep(1234,6)),
                 "case" = c(45,45,45,92,92,93),
                          "cpt" = c(20600,25263,8456,2345,9023,"C1245"),
                         "date"=c("2019-05-21", "2019-08-22","2019-04-12",
                                  "2019-03-01", "2019-02-18", "2019-03-12"),
                         "window"=c(10,20,NA,45,NA,NA),
                         "pay"=c(520,140,2200,230,600,700))

I essentially have two questions: How can I update the cpt column with the cpt value corresponding to the row with the highest pay grouped by case only focusing on any row containing a cpt value beginning with number 2 AND following this, how I can update the window column to replace any value that isn't the maximum with NA (also only focusing on rows where cpt value starts with 2)

I would like the final output to look like so:

ex1 <- data.frame("id" =c(rep(1234,6)),
                 "case" = c(45,45,45,92,92,93),
                 "cpt" = c(20600,20600,20600,2345,2345,"C1245"),
                 "date"=c("2019-05-21", "2019-08-22","2019-04-12",
                          "2019-03-01", "2019-02-18", "2019-03-12"),
                 "window"=c(10,NA,NA,45,NA,NA),
                 "pay"=c(520,140,2200,230,600,700))

I have two codes

ex[, cpt:=cpt[which.max(pay)], by=case]

ex %>% group_by(case) %>% mutate(window=replace(window, (pay < max(pay)), NA)) %>%
  setDT()

but both of these codes apply to the overall max value and will produce

ex2 <- data.frame("id" =c(rep(1234,6)),
                  "case" = c(45,45,45,92,92,93),
                  "cpt" = c(8456,8456,8456,9023,9023,"C1245"),
                  "date"=c("2019-05-21", "2019-08-22","2019-04-12",
                           "2019-03-01", "2019-02-18", "2019-03-12"),
                  "window"=c(NA,NA,NA,NA,NA,NA),
                  "pay"=c(520,140,2200,230,600,700))

Is there any way I could alter the codes I have to get the output I'm looking for? Or any other code that might be more efficient? Thanks in advance!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

We create a logical vector (i1) based on the first character of 'cpt' (substr - as the OP is interested in those elements that start with "2"), grouped by 'id', 'case', if there are any 'i1', then use that to subset the 'pay' get the index of the 'max' element with which.max and use that to subset the subset of 'cpt' (cpt[i1]) or else return the original column 'cpt', then we replace the 'window' where the 'cpt' is not the same as the duplicate column 'cpt1' (original column) to NA and ungroup

library(dplyr)
ex %>% 
  mutate(cpt1 = cpt) %>%
  group_by(id, case) %>% 
  mutate(i1 = substr(cpt, 1, 1) ==  "2", 
     cpt = if(any(i1)) cpt[i1][which.max(pay[i1])] else cpt,
     window = replace(window, cpt != cpt1, NA),
           i1 = NULL, cpt1 = NULL) %>%
    ungroup

-output

# A tibble: 6 x 6
#     id  case cpt   date       window   pay
#  <dbl> <dbl> <chr> <chr>       <dbl> <dbl>
#1  1234    45 20600 2019-05-21     10   520
#2  1234    45 20600 2019-08-22     NA   140
#3  1234    45 20600 2019-04-12     NA  2200
#4  1234    92 2345  2019-03-01     45   230
#5  1234    92 2345  2019-02-18     NA   600
#6  1234    93 C1245 2019-03-12     NA   700

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...