Coverage by intersecting smaller genomic interval data over larger genomic intervals using R -
i want intersect 2 genomic intervals in r. , want coverage stats of smaller interval on larger interval.
the larger interval data data frame ....
chr start end name val strand chr7 145444998 146102295 ccds5889.1 0 + chr7 146102406 146167735 ccds5889.1 0 + chr7 146167929 146371931 ccds5889.1 0 +
the smaller interval more 2 million rows .
chr start end name val strand phylop chr7 145444386 145444387 ccds5889.1 0 + 0.684764 chr7 145444387 145444388 ccds5889.1 0 + 0.684764 chr7 145444388 145444389 ccds5889.1 0 + 0.684764 chr7 145444389 145444390 ccds5889.1 0 + 0.684764
the interval data in 2nd (from) , 3rd (to) columns in both data frame.
the situation similar to
large interval: [-----] [-----] [--------------] [-------------------] small interval: ||| |||| ||||||||||| |||||||| |||| || ||||||||| || ||||||||
- i want know how of each of larger intervals covered smaller intervals.
- also, associate intersecting $phylop values each of large intervals later access plotting.
the bioconductor iranges , genomicranges packages have findoverlaps
, countoverlaps
, coverage
, other interval-based functions designed these operations. you'd use granges
function represent each of subject
("larger intervals") , query
("smaller intervals") objects above. see installation instructions , vignettes, e.g., browsevignettes("genomicranges")
in little more detail, here's data
sdf <- read.table(textconnection( "chr start end name val strand chr7 145444998 146102295 ccds5889.1 0 + chr7 146102406 146167735 ccds5889.1 0 + chr7 146167929 146371931 ccds5889.1 0 +"), header=true) qdf <- read.table(textconnection( "chr start end name val strand phylop chr7 145444386 145444387 ccds5889.1 0 + 0.684764 chr7 145444387 145444388 ccds5889.1 0 + 0.684764 chr7 145444388 145444389 ccds5889.1 0 + 0.684764 chr7 145444389 145444390 ccds5889.1 0 + 0.684764"), header=true)
and here convert these granges
, find intersection between query , subject
library(genomicranges) subj <- with(sdf, granges(chr, iranges(start, end), strand, val=val)) query <- with(qdf, granges(chr, iranges(start, end), strand, val=val, phylop=phylop, names=name)) intersect(query, subj)
the answer isn't exciting here
> intersect(query, subj) granges 0 ranges , 0 elementmetadata values seqnames ranges strand | seqlengths chr7 na
to little more useful, here's query tiles across total region
tiles <- successiveiranges(rep(100, 950), 900, 145444998) query <- granges("chr7", tiles, "+")
we find intersecting ranges, find subject overlapped each intersecting range, , summarize width of overlapping ranges
int <- intersect(query, subj) tapply(int, subjecthits(findoverlaps(int, subj)), function(x) sum(width(x)))
leading to
1 2 3 65800 6500 20400
Comments
Post a Comment