Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Loss of data after HaplotypeCaller

0

5 comments

  • Avatar
    SkyWarrior

    Bamout file is not your final bam file to keep. It only contains those haplotypes called by HC. It is only for debug/display purposes not for keeping as final bam. 

    1
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Thank you for your input SkyWarrior.

     

    mbesskd5 In addition to that, also take a look at this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called

    1
    Comment actions Permalink
  • Avatar
    mbesskd5

    Thank you both for your prompt response, I really appreciate it!

    I did check the bamout with my initial bam and SNPs appear in the same positions, although, they are not covered in all reads in the bamout. However they are covered in all reads in my initial file. I'm guessing that has to do with the reassembly as this suggested? https://gatk.broadinstitute.org/hc/en-us/articles/360035891111-Expected-variant-at-a-specific-site-was-not-called

    Also I read in the link that:

    "Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think."

    What is the unfiltered depth? Am I correct to understand that although I may see a DP=22 in my vcf the actual SNP coverage can only be half of the reads because of the ignoring of the low quality bases? In that case doesn't this mean the SNP is not actually a SNP, but an error or in my case DNA damage? Is there a way to filter the depth and only keep 100% SNP coverage?

    Thanks a lot! :)

     

     

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    You may trim out bad reads but that is not the ultimate solution. All variant callers apply base quality and mapping quality filters that can be adjusted by user however often times users will not be aware of what is going on inside the whole bam file so this is a dangerous task. Canonically BQ and MQ above 20 is usually safe to include in your variant calling. Unless you work with very low coverage samples then there is really nothing to worry about the result. However if you are working with low coverage samples all the time then you may need to readjust your parameters and compare your call strength with each parameter set until you come up with an optimal solution. 

    1
    Comment actions Permalink
  • Avatar
    mbesskd5

    Hi SkyWarrior

    You are a star thank you. I have trimmed bad reads as the first step to my pipeline using AdapterRemoval, and I usually keep quality scores to 25-30, never below 20, and I follow this rule throughout my pipeline, precisely because I'm working with low coverage data (ancient DNA). And for every dataset I've had I used GATK and I got SNP calls that are not covered in all reads but are above Q>25.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk