Why is the coverage cut abruptly for a region in the bamout file of Mutect2 in comparison to the region in input files?
Dear GATK Team,
In the (Notebook) Intro to using Mutect2 for somatic data documentation, there are two IGV images in the section 'REVIEW CALLS WITH IGV', and the TP53 locus at chr17:7,666,402-7,689,550 is being assessed.
Regarding the IGV image focusing on the above locus in tumor.bam and normal.bam (Mutect2 input files in the example) and 2_tumor_normal_m2.bam (Mutect2 bamout file in the example):
Why is coverage seen as a normal distribution in tumor.bam and normal.bam for the region with the somatic call, with reads spanning this coverage, but the coverage has been cut abruptly at either end of the region in 2_tumor_normal_m2.bam?
Thank you for your time and help.
-
ISmolicz The bamout shows the reads after they are trimmed to a fit a local assembly region. The abrupt cutoff is simply the boundary of this assembly region.
-
Thank you for your reply David Benjamin.
Please could you advise on the following:
1. Is the assembly region referred to in your explanation synonymous with ActiveRegion?
2. What are the parameters that define the trimming and therefore the boundary of the assembly region? Is it those described in the ActiveRegion determination (HaplotypeCaller and Mutect2) documentation?
3. Why are the reads trimmed to fit a local assembly region?
Thank you for your time and help.
-
1. Essentially yes, although inside the GATK code there are small amounts of trimming between the two.
2. The arguments are:
- --min-assembly-region-size
- --max-assembly-region-size
- --assembly-region-padding
- --padding-around-indels
- --padding-around-snps
- --padding-around-strs
It is rarely a good idea to use these.
3. It saves some CPU and it's convenient to have a de Bruijn graph with a single source and single sink vertex.
-
Hi David Benjamin,
Thank you for answering my queries.
Would it be possible to expand on the answer to Q3 above a little further:
1. How does trimming lead to having a Linked De Bruijn graph as you have described, and why is this convenient for this type of analysis?
2. Would trimming not lead to expected and/or unexpected variants being missed? Therefore, is it not worth having higher CPU?
Thank you for your help.
-
1. Without a single source and sink you need some way of reconciling different paths in the graph (i.e. local haplotypes) in order to make read realignment likelihoods mathematically comparable. Any scheme of reconciling paths would basically be tantamount to forcing a single source and sink.
2. That's the point of padding -- the GATK only trims a safe distance away from any apparent variation.
Please sign in to leave a comment.
5 comments