Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
761 views
in Technique[技术] by (71.8m points)

r - Spread vs dcast

I have a table like this,

> head(dt2)
  Weight Height   Fitted interval limit    value
1   65.6  174.0 71.91200     pred   lwr 53.73165
2   80.7  193.5 91.63237     pred   lwr 73.33198
3   72.6  186.5 84.55326     pred   lwr 66.31751
4   78.8  187.2 85.26117     pred   lwr 67.02004
5   74.8  181.5 79.49675     pred   lwr 61.29244
6   86.4  184.0 82.02501     pred   lwr 63.80652

I want it to have like this,

> head(reshape2::dcast(dt2, 
         Weight + Height + Fitted + interval ~ limit, 
         fun.aggregate = mean))
  Weight Height   Fitted interval      lwr      upr
1   42.0  153.4 51.07920     conf 49.15463 53.00376
2   42.0  153.4 51.07920     pred 32.82122 69.33717
3   43.2  160.0 57.75378     conf 56.35240 59.15516
4   43.2  160.0 57.75378     pred 39.54352 75.96404
5   44.8  149.5 47.13512     conf 44.87642 49.39382
6   44.8  149.5 47.13512     pred 28.83891 65.43133

But using tidyr::spread, How can I do that?

I was using,

> tidyr::spread(dt2, limit, value)

But getting the error,

Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)

Random 10 Rows::

> dt[sample(nrow(dt), 10), ]
     Weight Height   Fitted interval limit    value
1253   52.2  162.5 60.28203     conf   upr 61.51087
426    49.1  158.8 56.54022     pred   upr 74.75756
1117   78.4  184.5 82.53066     conf   lwr 80.98778
1171   85.9  166.4 64.22611     conf   lwr 63.21254
948    61.4  177.8 75.75494     conf   lwr 74.66393
384    90.9  172.7 70.59731     pred   lwr 52.41828
289    75.9  172.7 70.59731     pred   lwr 52.41828
3      44.8  149.5 47.13512     pred   lwr 28.83891
774    87.3  182.9 80.91258     pred   upr 99.12445
772    86.4  175.3 73.22669     pred   upr 91.40919
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Let's say you were starting with data that looked like this:

mydf
#   Weight Height  Fitted interval limit    value
# 1     42  153.4 51.0792     conf   lwr 49.15463
# 2     42  153.4 51.0792     pred   lwr 32.82122
# 3     42  153.4 51.0792     conf   upr 53.00376
# 4     42  153.4 51.0792     pred   upr 69.33717
# 5     42  153.4 51.0792     conf   lwr 60.00000
# 6     42  153.4 51.0792     pred   lwr 90.00000

Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise for the "value" column.

If you stopped the execution at the summarise step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread would work as expected.

mydf %>% 
  group_by(Weight, Height, Fitted, interval, limit) %>% 
  summarise(value = mean(value)) %>% 
  spread(limit, value)
# Source: local data frame [2 x 6]
# 
#   Weight Height  Fitted interval      lwr      upr
#    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

This matches the expected output from dcast with fun.aggregate = mean.

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
#   Weight Height  Fitted interval      lwr      upr
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

Sample data:

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
     153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
     51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
     "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
     2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
         value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
         90)), .Names = c("Weight", "Height", "Fitted", "interval",     
     "limit", "value"), row.names = c(NA, 6L), class = "data.frame")   

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...