Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.7k views
in Technique[技术] by (71.8m points)

apply - Overlap across dataframes in R

I am trying to check the overlap between one and several other files (overlap_files in code below).

Main file:

chr1    8014812 8014812
chr1    22371954    22371954
chr1    35328666    35328666

Example of overlap_files:

chr1    8014812 8014812
chr1    22371954    22371954

My code looks like this:

# Load variants
a1 <- read.table("main.txt", header=FALSE)

#Begin looping
overlap=lapply(overlap_files, 
function(x) {

#Load in "x" file skipping empty files
t=if(!file.size(x) == 0) {
read.table(x, header=FALSE)
}
#Overlap
apply(a1, 1, function(x) 
    ifelse(any(x[1]==t$V1 & x[2]==t$V2 & x[3]==t$V3), '1','0')) 
})

Although the two first rows exist in both files, in the output the first variant is marked as 0 (it should have been 1), the second as 1 (correct) and the third as 0 (correct). It seems to be because of the difference in length (i.e. 8014812 has 7 digits, while the other two numbers 8 digits). Is there a way of fixing this? Thank you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

From your example, I am not entirely sure what the separators in your files are. (tabs?)

Either way, I would propose the following approach:

  1. Read in files as data frames (one per file)
  2. Using dplyr::join will give you all rows that match (you can define multiple columns to match across with the by property)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...