Bamout looks much worse than original BAM, false positive calls?
Hi all,
I hope you're well :)
I have a question about GATK 4.1.5.0. A student in my lab is trying to call variants on an individual BAM file, but when i look into some of the specific variants called by GATK HaplotypeCaller I couldn't see them in the original BAM (which looks fine at the evaluated region). However, the bamout shows them, along with what appears to be a bad realignment of the region. Can you please advise if I am looking at this the wrong way?
Region: chr1:28447875-28447892
Screenshot of region before GATK HC:
Screenshot of region after GATK HC:
Entries in the resulting VCF (before VQSR):
1 28447875 . GAGGC G 151.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.430e-01;DP=22;ExcessHet=3.0103;FS=5.378;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=6.89;ReadPosRankSum=0.792;SOR=2.160 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:17,5:22:99:0|1:28447855_GGACGGGAGGACGGA_G:159,0,699:28447855
1 28447880 . G GT 169.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.819e+00;DP=16;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=10.60;ReadPosRankSum=0.587;SOR=1.445 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:11,5:16:99:0|1:28447855_GGACGGGAGGACGGA_G:177,0,447:28447855
1 28447883 . C A 175.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.783e+00;DP=14;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=12.55;ReadPosRankSum=0.992;SOR=1.022 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:9,5:14:99:0|1:28447855_GGACGGGAGGACGGA_G:183,0,363:28447855
1 28447888 . C G 181.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.280e-01;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=15.14;ReadPosRankSum=2.12;SOR=0.446 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:7,5:12:99:0|1:28447855_GGACGGGAGGACGGA_G:189,0,279:28447855
1 28447892 . C T 181.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.067e+00;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=15.14;ReadPosRankSum=1.06;SOR=0.446 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:7,5:12:99:0|1:28447855_GGACGGGAGGACGGA_G:189,0,279:28447855
On top of how the alignment looks, these variants are not present in gnomAD (except for one in one allele) which makes me think they are artifacts of this realignment.
Exact commands used:
gatk --java-options “-Xmx4g” HaplotypeCaller
-I 94H.recal.bam
-R RefG/hs37d5.fa/
--base-quality-score-threshold 20
-O 94H.g.vcf.gz -ERC GVCF
-G StandardAnnotation
-G AS_StandardAnnotation
-G StandardHCAnnotation
-bamout 94H.bamout.bam
gatk --java-options “-Xmx4g” GenotypeGVCFs
-V 94H.g.vcf.gz
-R RefG/hs37d5.fa/
-O 94H.vcf.gz
The file we are looking at here is 94H.vcf.gz.
Can you please advise if we are doing this correctly or how to correct this issue? Thanks very much for your input and help trying to understand this!
Bets wishes,
Daniela
-
Hi, it's me again! Geraldine pointed out to me on Twitter that it would be better if we showed the alignments in IGV, and so here it is: On the top, the original pre-GATK HC BAM file, the bottom one is the bamout. Thanks to Paty Basurto, PhD student in the lab, for the image :)
Thanks very much!
-
Hi Daniela Robles, we have some documentation on the HaplotypeCaller Algorithm to give more background and could provide insight if this might be from realignment. Have you done any filtering of these variants? I wonder if they would pass filtering steps.
-
Hi Genevieve,
Thanks so so much for your kind answer! We have done VQSR and these variants pass the filters. However, it all looks a bit strange as our false positives seem to be much more than our true positives (which we expected anyway from the large numbers of variants being called in these germline exomes, in excess of 100,000), and the Ti/Tv ratio looks bad:
This is the VQSR commands are:
using gatk4.1.8.1gatk MakeSitesOnlyVcf \-I 94H.vcf.gz \-O 94H_sitesonly.vcfgatk --java-options “-Xmx3g -Xms3g” VariantRecalibrator \
-V 94H_sitesonly.vcf \
--trust-all-polymorphic \
-tranche 100.0 -tranche 99.95 -tranche 99.8 -tranche 99.7 -tranche 99.5 -tranche 99.4 -tranche 99.3 -tranche 99.1 -tranche 98.9 -tranche 98.7 -tranche 98.5 -tranche 98.3 \
-an QD \
-an MQRankSum \
-an ReadPosRankSum \
-an FS \
-an MQ \
-an SOR \
-mode SNP \
--max-gaussians 4 \
-resource:hapmap,known=false,training=true,truth=true,prior=15 resources/b37/hapmap_3.3.b37.vcf \
-resource:omni,known=false,training=true,truth=true,prior=12 resources/b37/1000G_omni2.5.b37.vcf \
-resource:1000G,known=false,training=true,truth=false,prior=10 resources/b37/1000G_phase1.snps.high_confidence.b37.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=7 resources/b37/dbsnp_138.b37.vcf \
-O 94H_snps.recal \
--tranches-file 94H_snps.tranches \
--rscript-file plot.script.Rgatk --java-options “-Xmx24g -Xms24g” VariantRecalibrator \
-V 94H_sitesonly.vcf \
--trust-all-polymorphic -tranche 98.3 -tranche 98.5 -tranche 98.7 -tranche 98.9 -tranche 99.1 -tranche 99.3 -tranche 99.4 -tranche 99.5 -tranche 99.6 -tranche 99.7 -tranche 99.9 -tranche 100 -tranche 97.9 -tranche 97.5 \
-an ReadPosRankSum \
-an MQRankSum \
-an QD \
-an SOR \
-mode INDEL \
--max-gaussians 4 \
-resource:mills,known=false,training=true,truth=true,prior=12 resources/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=2 resources/b37/dbsnp_138.b37.vcf \
-O 94H_indels.recal \
--tranches-file 94H_indels.tranchesgatk --java-options “-Xmx5g -Xms5g” \
ApplyVQSR \
-V 94H_indel.recalibrated.vcf \
--recal-file 94H_snps.recal \
--tranches-file 94H_snps.tranches \
--truth-sensitivity-filter-level 98.5 \
--create-output-variant-index true \
-mode SNP \
-O 94H.recalibrated.vcfgatk --java-options “-Xmx5g -Xms5g” \
ApplyVQSR \
-V 94H.vcf.gz \
--recal-file 94H_indels.recal \
--tranches-file 94H_indels.tranches \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode INDEL \
-O 94H_indel.recalibrated.vcfResources were copied from the gatk resource bundle b37 directory using the ftp connection.Any help would be greatly appreciated. Thanks so much :)Best wishes,Daniela -
Daniela Robles How many samples are did you use for VQSR?
-
Hi again Genevieve Brandt (she/her) :) this was only one sample. Could that be the problem? Is there a way to filter only one?
Thanks so much!
Daniela
-
Hi Daniela Robles, VQSR needs a lot of data to succeed, we recommend at least 30 exomes. Here is our documentation about VQSR, and you will see under #6 Problems, that it is greedy for data.
Here is an article explaining how to filter with VQSR or Hard Filtering. Another option you could look into is our CNN tools.
-
Thanks so much for your answer Genevieve Brandt (she/her). I have seen the resources you put there and they are very informative, thanks. We had chosen this design as we are looking for very rare disease-associated variants that will not be in these other exomes. If we follow this joint calling methodology with, say the 1000G exomes, will we not lose the sensitivity to detect these?
Thanks so much for answering all our queries! I really appreciate it.
Best wishes,
Daniela -
No problem Daniela Robles, joint calling should make those sites more distinguishable because the VQSR algorithm will more accurately determine true positives and false positives. Do you have more exomes that followed the same experimental design? If so, I would try that pipeline!
I'll also get more information from my team regarding this.
-
Thanks so so much Genevieve Brandt (she/her)! So what we will do is take some other exomes we had, and re-align them the same way as our test case, and have them all go through the pipeline. Hopefully the ti/tv will look better and we won't lose many sites! I will report back on the results if there are any other people that come across this in the future :)
-
Hi Daniela Robles, I have confirmed with my team that we recommend joint calling when looking for rare disease variants. You can also try hard filtering with your single rare disease sample: https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering
Thank you for bringing up this discussion to help out other users!
-
Thanks so much Genevieve Brandt (she/her), you are really kind! We are in the middle of following your advice and will report here on the results when we've done it. Thanks a lot again, and hope you have a great weekend :)
-
When inspecting the bamout file, I notice that the query name (column #1) has a mix the of the original read names and reads whose names are HC01, HC02, etc. Can anyone tell me the distinction between these reads? Is the "HC" prefix used to denote reads that have undergone re-alignment by HaplotypeCaller?
-
Hi Nick Dietz
You are right that reads with HC prefix are the haplotypes generated by HaplotypeCaller itself. Those reads can be visualized and even grouped and colored based on the HC tag present in them so that you may distinguish between different haplotypes that HaplotypeCaller detected in the actual alignment.
Regards.
-
Thank you, Gökalp Çelik. Does that mean that HaplotypeCaller is ONLY using the reads prefixed with "HC" to call variants? Or are the other reads (the ones with the original names) being used as well?
-
HaplotypeCaller uses all the informative reads within the active region of the original bam file to perform its de-bruijin graph construction and haplotyping function to get variants. Those HC tagged reads are the result of HaplotypeCaller's work in that active region. Some haplotypes may not be as informative as others but they are not pushed out to bamout by default. There is also a parameter to force this behavior so that you may see all haplotypes generated by HaplotypeCaller regardless of being informative or non-informative.
Please sign in to leave a comment.
15 comments