Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
465 views
in Technique[技术] by (71.8m points)

substring - How do you select multiple values for grep across multiple columns in R?

this is my first question, sorry if I do this wrong, and sorry for it being so long...

I have a table of genomes from an entire genus that I would like to compare at a smaller level, such as within one or more species. My table is contains 3 columns: p1, p2, and percent identity. My rows are each comparisons between species.

p1 contains a list of genomes as does p2. Whatever number starts with the lowest digit is placed in p1 and the number with the higher digit goes in p2. The genome names are in the format 1_1_1, so p1 may be 1_1_1 and p2 may be 2_1_1200, but in the next row p1 could be 2_1_1200 if p2 is 3_1_23. The third column is the percent identity between them, but should not be relevant I don't think.

Multiple genomes belong to the same species, but they are not in any kind of order. For example, 42, 54, 210, and 694 are the same species. I would like to find only the rows where both p1 and p2 contain these numbers, so 42 to 54, 54 to 210, etc, but not 1 to 42. This species only has 4 genomes, but some have as many as 582 to compare.

So far: They are bacterial genomes, so the genes are not in the same order, and the third digit corresponds to the gene position, so I've been using "^42" to call 42_1_622, for example. I don't want 642_1, so I anchored the 42 to the beginning. All middle digits are 1.

subset_species_1 <- rbind(x[grep("^42_", x$p1), ], 
            x[grep("^42_", x$p2), ], 
            x[grep("^54_", x$p1), ], 
            x[grep("^54_", x$p2), ],
            x[grep("^210_", x$p1), ],
            x[grep("^210_", x$p2), ],
            x[grep("^694_", x$p1), ],
            x[grep("^694_", x$p2), ])

This is obviously tedious, and it gives me all of the rows with any of these genomes in either column, not only rows with these genomes in both columns.

In addition, each table only represents one gene, and ideally I'd like to use the same subsets for every table, of which there are thousands.

Thank you in advance, I need all the help I can get!

Edited to add: I'm doing this in R/Rstudio

question from:https://stackoverflow.com/questions/65922772/how-do-you-select-multiple-values-for-grep-across-multiple-columns-in-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

How about something like this. Rather than using regex to find the beginnings, why not just split the digits before the first underscore off from the rest and see whether those are in some pre-defined vector of values. That's what I've done below with find_vals being the values I'm looking for.

library(glue)
library(dplyr)
library(stringr)
set.seed(402943)
dat <- tibble(
  p1 = glue("{sample(1:250, 250, replace=TRUE)}_1_{sample(1:250, 250, replace=TRUE)}"), 
  p2 = glue("{sample(1:250, 250, replace=TRUE)}_1_{sample(1:250, 250, replace=TRUE)}"),
  p = runif(250, 0,1)
)

find_vals <- as.character(42:100)
dat %>% mutate(p11 = str_split(p1, "_", simplify=TRUE)[,1], 
              p21 = str_split(p2, "_", simplify=TRUE)[,1]) %>% 
  filter(p11 %in% find_vals & p21 %in% find_vals)
# A tibble: 16 x 5
#   p1       p2            p p11   p21  
#   <glue>   <glue>    <dbl> <chr> <chr>
# 1 54_1_222 93_1_180 0.626  54    93   
# 2 61_1_47  48_1_47  0.639  61    48   
# 3 74_1_89  99_1_42  0.556  74    99   
# 4 54_1_71  87_1_144 0.287  54    87   
# 5 54_1_10  71_1_140 0.216  54    71   
# 6 57_1_242 79_1_107 0.238  57    79   
# 7 70_1_185 71_1_55  0.538  70    71   
# 8 48_1_140 80_1_139 0.0752 48    80   
# 9 72_1_105 62_1_56  0.213  72    62   
# 10 70_1_241 64_1_220 0.857  70    64   
# 11 57_1_213 97_1_47  0.432  57    97   
# 12 55_1_56  45_1_249 0.907  55    45   
# 13 55_1_9   44_1_156 0.633  55    44   
# 14 59_1_153 96_1_228 0.154  59    96   
# 15 61_1_97  99_1_189 0.556  61    99   
# 16 83_1_56  86_1_85  0.787  83    86   
# 


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...