I'm having some DNA data (alleles) for say 3 people, each row representing a SNP. In order to get some shareable test data, I would like to rowwise randomly sample data into a new tibble to get some fake DNA data which doesn't represent a real person.
For example, my initial tibble, data
, could looks like this:
person_1, person_2, person_3
AA, AG, GG (i.e. data from person_1 person_2 person_3)
AC, CC, AC (i.e. data from person_1 person_2 person_3)
.. .. ..
I would like the result to be like this:
random_1, random_2, random_3
GG, AA, AG (i.e. randomly assigned to person_3, person_1, person_2)
CC, AC, AC (i.e. randomly assigned to person_2, person_3, person_1)
...
I'm already able to do this with the following code:
data %>%
split(f = 1:nrow(.)) %>%
purrr::map_dfr(~ .x[,sample(1:ncol(.x),ncol(.x))] %>%
rename( setNames(object = names(.),
nm = paste0("test_",sprintf("%02d", 1:length(.))))
)
)
However, my challenge is that my tibble has more than 700.000 rows, which makes the code above extremely slow. I have tried to do the operation via mutate()
, rowwise()
and across
from the dplyr
package, but I have been unsuccessful.
Any suggestions for other approaches that are faster?