Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
323 views
in Technique[技术] by (71.8m points)

bioinformatics - Finding overlap in ranges with R

I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop.

Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.

rangesA:

chrom   start   stop
 5       100     105
 1       200     250
 9       275     300

rangesB:

chrom    start    stop
  1       200      265
  5       99       106
  9       275      290

for each row in rangesA:

matches <- which((rangesB[,'chrom']  == rangesA[row,'chrom']) &&
                 (rangesB[,'start'] <= rangesA[row, 'start']) &&
                 (rangesB[,'stop'] >= rangesA[row, 'stop']))

I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use the IRanges/GenomicRanges packages from Bioconductor, which is made for dealing with these exact problems (and scales massively)

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

There are a few appropriate containers for ranges on different chromosomes, one is RangesList

library(IRanges)
rangesA <- split(IRanges(rangesA$start, rangesA$stop), rangesA$chrom)
rangesB <- split(IRanges(rangesB$start, rangesB$stop), rangesB$chrom)
#which rangesB wholly contain at least one rangesA?
ov <- countOverlaps(rangesB, rangesA, type="within")>0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...