Interval Filtering -- How does it actually work?
This is a general question about the inner-working of interval filtration (i.e. --intervals, -L parameters with a supplied "intervals"-containing file). Since this functionality is provided for a good number of tools in the GATK toolkit, I was trying to find documentation/explanations of how filtration actually proceeds.
I'm wondering, in particular, if it checks every read (independently of mate information) and filters the read if it doesn't overlap any provided intervals (thus leaving behind singleton reads if one partner in a read pair doesn't overlap with an RoI). What constitutes as overlap -- a single position between a read and an interval? Also, do secondary and/or supplementary reads get any special treatment, or are they treated the same way as primary alignments?
-
Hi Alijah O'Connor, have you seen our documentation on intervals and how GATK uses those lists? https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
There is also an option in our tools to change the combining behavior with --interval-set-rule. You can see an explanation in the HaplotypeCaller documentation.
-
For sure, but I'm more trying to get at the exact operations that are going on under the hood. The documentation doesn't really help with the questions I posed in the original post.
An additional related question would be, do the entire reads get filtered out if only part of the read overlaps or just the parts of reads that don't overlap?
But, I'm most interested in the questions from the original post
-
Alijah O'Connor here are the answers to these questions:
- Yes, every read is checked. Each mate is checked independently and so one could be left behind if it does not overlap with the interval.
- Yes, an overlap can be just one base. The start and end position of each read is checked and reads are kept that overlap with the end position of the interval.
- Secondary/supplementary reads are not treated differently regarding intervals. However, other GATK filtering might lose those reads. You can get more information about those filters with the specific tool you are running.
- The whole read is kept if part of it overlaps with an interval. Something to keep in mind though, is that this is different than how HaplotypeCaller defines regions. With HaplotypeCaller and regions, a read will get clipped if it goes over a region boundary. The rest of the read may be present in the next region though.
Hope this helps!
-
Thank you! Yes, this very helpful in understanding exactly what behavior to expect when I'm using this parameter.
Please sign in to leave a comment.
4 comments