Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Why is the coverage cut abruptly for a region in the bamout file of Mutect2 in comparison to the region in input files?

0

5 comments

  • Avatar
    David Benjamin

    ISmolicz The bamout shows the reads after they are trimmed to a fit a local assembly region.  The abrupt cutoff is simply the boundary of this assembly region.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for your reply David Benjamin.

    Please could you advise on the following:

    1. Is the assembly region referred to in your explanation synonymous with ActiveRegion?

    2. What are the parameters that define the trimming and therefore the boundary of the assembly region? Is it those described in the ActiveRegion determination (HaplotypeCaller and Mutect2) documentation?

    3. Why are the reads trimmed to fit a local assembly region?

    Thank you for your time and help.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    1. Essentially yes, although inside the GATK code there are small amounts of trimming between the two.

    2.  The arguments are:

    • --min-assembly-region-size
    • --max-assembly-region-size
    • --assembly-region-padding
    • --padding-around-indels
    • --padding-around-snps
    • --padding-around-strs

    It is rarely a good idea to use these.

    3.  It saves some CPU and it's convenient to have a de Bruijn graph with a single source and single sink vertex.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Hi David Benjamin,

    Thank you for answering my queries.

    Would it be possible to expand on the answer to Q3 above a little further:

    1. How does trimming lead to having a Linked De Bruijn graph as you have described, and why is this convenient for this type of analysis?

    2. Would trimming not lead to expected and/or unexpected variants being missed? Therefore, is it not worth having higher CPU?

    Thank you for your help.

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    1.  Without a single source and sink you need some way of reconciling different paths in the graph (i.e. local haplotypes) in order to make read realignment likelihoods mathematically comparable.  Any scheme of reconciling paths would basically be tantamount to forcing a single source and sink.

    2.  That's the point of padding -- the GATK only trims a safe distance away from any apparent variation.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk