Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Bamout looks much worse than original BAM, false positive calls?

0

15 comments

  • Avatar
    Daniela Robles

    Hi, it's me again! Geraldine pointed out to me on Twitter that it would be better if we showed the alignments in IGV, and so here it is: On the top, the original pre-GATK HC BAM file, the bottom one is the bamout. Thanks to Paty Basurto, PhD student in the lab, for the image :)

    Thanks very much!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Daniela Robles, we have some documentation on the HaplotypeCaller Algorithm to give more background and could provide insight if this might be from realignment.  Have you done any filtering of these variants? I wonder if they would pass filtering steps.

    0
    Comment actions Permalink
  • Avatar
    Daniela Robles

    Hi Genevieve,

    Thanks so so much for your kind answer! We have done VQSR and these variants pass the filters. However, it all looks a bit strange as our false positives seem to be much more than our true positives (which we expected anyway from the large numbers of variants being called in these germline exomes, in excess of 100,000), and the Ti/Tv ratio looks bad:

    This is the VQSR commands are:

    using gatk4.1.8.1
     
     
    gatk MakeSitesOnlyVcf \-I 94H.vcf.gz \-O 94H_sitesonly.vcf
     
    gatk --java-options “-Xmx3g -Xms3g” VariantRecalibrator \
     -V 94H_sitesonly.vcf \
     --trust-all-polymorphic \
     -tranche 100.0 -tranche 99.95 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.1 -tranche 98.9 -tranche 98.7 -tranche 98.5 -tranche 98.3 \
     -an QD \
     -an MQRankSum \
     -an ReadPosRankSum \
     -an FS \
     -an MQ \
     -an SOR \
     -mode SNP \
     --max-gaussians 4 \
     -resource:hapmap,known=false,training=true,truth=true,prior=15 resources/b37/hapmap_3.3.b37.vcf \
     -resource:omni,known=false,training=true,truth=true,prior=12 resources/b37/1000G_omni2.5.b37.vcf \
     -resource:1000G,known=false,training=true,truth=false,prior=10 resources/b37/1000G_phase1.snps.high_confidence.b37.vcf \
     -resource:dbsnp,known=true,training=false,truth=false,prior=7 resources/b37/dbsnp_138.b37.vcf \
     -O 94H_snps.recal \
     --tranches-file 94H_snps.tranches \
     --rscript-file plot.script.R
     
    gatk --java-options “-Xmx24g -Xms24g” VariantRecalibrator \
    -V 94H_sitesonly.vcf \
     --trust-all-polymorphic -tranche 98.3 -tranche 98.5 -tranche 98.7 -tranche 98.9 -tranche 99.1 -tranche 99.3 -tranche 99.4 -tranche 99.5 -tranche 99.6 -tranche 99.7 -tranche 99.9 -tranche 100 -tranche 97.9 -tranche 97.5 \
     -an ReadPosRankSum \
     -an MQRankSum \
     -an QD \
     -an SOR \
     -mode INDEL \
     --max-gaussians 4 \
     -resource:mills,known=false,training=true,truth=true,prior=12 resources/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \
     -resource:dbsnp,known=true,training=false,truth=false,prior=2 resources/b37/dbsnp_138.b37.vcf \
     -O 94H_indels.recal \
     --tranches-file 94H_indels.tranches
     
     
    gatk --java-options “-Xmx5g -Xms5g” \
    ApplyVQSR \
    -V 94H_indel.recalibrated.vcf \
    --recal-file 94H_snps.recal \
    --tranches-file 94H_snps.tranches \
    --truth-sensitivity-filter-level 98.5 \
    --create-output-variant-index true \
    -mode SNP \
    -O 94H.recalibrated.vcf
     
    gatk --java-options “-Xmx5g -Xms5g” \
    ApplyVQSR \
     -V 94H.vcf.gz \
     --recal-file 94H_indels.recal \
     --tranches-file 94H_indels.tranches \
     --truth-sensitivity-filter-level 99.7 \
     --create-output-variant-index true \
     -mode INDEL \
     -O 94H_indel.recalibrated.vcf
     
     
     
     
    Resources were copied from the gatk resource bundle b37 directory using the ftp connection.
     
    Any help would be greatly appreciated. Thanks so much :)
     
    Best wishes,
    Daniela
     
     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Daniela Robles How many samples are did you use for VQSR?

    0
    Comment actions Permalink
  • Avatar
    Daniela Robles

    Hi again Genevieve Brandt (she/her) :) this was only one sample. Could that be the problem? Is there a way to filter only one?

    Thanks so much!

    Daniela

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Daniela Robles, VQSR needs a lot of data to succeed, we recommend at least 30 exomes. Here is our documentation about VQSR, and you will see under #6 Problems, that it is greedy for data. 

    Here is an article explaining how to filter with VQSR or Hard Filtering. Another option you could look into is our CNN tools

    0
    Comment actions Permalink
  • Avatar
    Daniela Robles

    Thanks so much for your answer Genevieve Brandt (she/her). I have seen the resources you put there and they are very informative, thanks. We had chosen this design as we are looking for very rare disease-associated variants that will not be in these other exomes. If we follow this joint calling methodology with, say the 1000G exomes, will we not lose the sensitivity to detect these?

    Thanks so much for answering all our queries! I really appreciate it.

    Best wishes,
    Daniela

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    No problem Daniela Roblesjoint calling should make those sites more distinguishable because the VQSR algorithm will more accurately determine true positives and false positives. Do you have more exomes that followed the same experimental design? If so, I would try that pipeline!

    I'll also get more information from my team regarding this.

    0
    Comment actions Permalink
  • Avatar
    Daniela Robles

    Thanks so so much Genevieve Brandt (she/her)! So what we will do is take some other exomes we had, and re-align them the same way as our test case, and have them all go through the pipeline. Hopefully the ti/tv will look better and we won't lose many sites! I will report back on the results if there are any other people that come across this in the future :)

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Daniela Robles, I have confirmed with my team that we recommend joint calling when looking for rare disease variants. You can also try hard filtering with your single rare disease sample: https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering

    Thank you for bringing up this discussion to help out other users!

    0
    Comment actions Permalink
  • Avatar
    Daniela Robles

    Thanks so much Genevieve Brandt (she/her), you are really kind! We are in the middle of following your advice and will report here on the results when we've done it. Thanks a lot again, and hope you have a great weekend :)

    0
    Comment actions Permalink
  • Avatar
    Nick Dietz

    When inspecting the bamout file, I notice that the query name (column #1) has a mix the of the original read names and reads whose names are HC01, HC02, etc. Can anyone tell me the distinction between these reads? Is the "HC" prefix used to denote reads that have undergone re-alignment by HaplotypeCaller?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Nick Dietz

    You are right that reads with HC prefix are the haplotypes generated by HaplotypeCaller itself. Those reads can be visualized and even grouped and colored based on the HC tag present in them so that you may distinguish between different haplotypes that HaplotypeCaller detected in the actual alignment. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Nick Dietz

    Thank you, Gökalp Çelik. Does that mean that HaplotypeCaller is ONLY using the reads prefixed with "HC" to call variants? Or are the other reads (the ones with the original names) being used as well?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    HaplotypeCaller uses all the informative reads  within the active region of the original bam file to perform its de-bruijin graph construction and haplotyping function to get variants. Those HC tagged reads are the result of HaplotypeCaller's work in  that active region. Some haplotypes may not be as informative as others but they are not pushed out to bamout by default. There is also a parameter to force this behavior so that you may see all haplotypes generated by HaplotypeCaller regardless of being informative or non-informative. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk