Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
513 views
in Technique[技术] by (71.8m points)

How to rowwise select random elements in a tibble via dplyr in R?

I'm having some DNA data (alleles) for say 3 people, each row representing a SNP. In order to get some shareable test data, I would like to rowwise randomly sample data into a new tibble to get some fake DNA data which doesn't represent a real person.

For example, my initial tibble, data, could looks like this:

person_1,   person_2,   person_3

AA,         AG,         GG (i.e. data from person_1   person_2   person_3)

AC,         CC,         AC (i.e. data from person_1   person_2   person_3)

..         ..         ..

I would like the result to be like this:

random_1,  random_2,  random_3

GG,         AA,        AG (i.e. randomly assigned to person_3, person_1, person_2)

CC,         AC,        AC (i.e. randomly assigned to person_2, person_3, person_1)

...

I'm already able to do this with the following code:

data %>% 
  split(f = 1:nrow(.)) %>% 
  purrr::map_dfr(~ .x[,sample(1:ncol(.x),ncol(.x))] %>% 
                   rename( setNames(object = names(.),
                                    nm = paste0("test_",sprintf("%02d", 1:length(.))))
                   )
  )

However, my challenge is that my tibble has more than 700.000 rows, which makes the code above extremely slow. I have tried to do the operation via mutate(), rowwise() and across from the dplyr package, but I have been unsuccessful.

Any suggestions for other approaches that are faster?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

We can use pmap (from purrr) with sample.

library(dplyr)
library(purrr)
library(stringr)
df1 %>%
    pmap_dfr(~ sample(c(...))) %>%
    rename_all(~ str_c('random_', seq_along(.)))

-output

# A tibble: 2 x 3
#  random_1 random_2 random_3
#  <chr>    <chr>    <chr>   
#1 AG       AA       GG      
#2 CC       AC       AC    

Or another option is to reshape to 'long' format, do a group by slice_sample and then reshape back to 'wide'

library(tidyr)
df1 %>%
   mutate(rn = row_number()) %>% 
   pivot_longer(cols = -rn) %>% 
   group_by(rn) %>% 
   slice_sample(prop = 1) %>% 
   mutate(name = str_c('random_', row_number())) %>% 
   ungroup %>% 
   pivot_wider(names_from = name, values_from = value)
# A tibble: 2 x 4
#     rn random_1 random_2 random_3
#  <int> <chr>    <chr>    <chr>   
#1     1 AG       GG       AA      
#2     2 CC       AC       AC   

There is an option to use rowwise, but, it would be less efficient assuming the number of rows are 700000

df1 %>% 
   rowwise %>%
   transmute(col1 = list(sample(c_across(everything())))) %>%
   unnest_wider(c(col1), names_repair =  ~ str_c('random_', seq_along(.)))
# A tibble: 2 x 3
#  random_1 random_2 random_3
#  <chr>    <chr>    <chr>   
#1 AG       AA       GG      
#2 CC       AC       AC      

In base R, this can be done using apply

out <- as.data.frame(t(apply(df1, 1, sample)))
names(out) <- paste0('random_', seq_along(out))

data

df1 <- structure(list(person_1 = c("AA", "AC"), person_2 = c("AG", "CC"
), person_3 = c("GG", "AC")), class = "data.frame", row.names = c(NA, 
-2L))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...