how to get rid of haplotypecaller variants caused by weird artificial haplotypes
HI,
I am using GATK 4.1.2 I have panel sequencing data for about 700 samples, which i have aligned (BWA), duplicate-marked, base-quality recalibrated, etc. and called gvcf with haplotype caller individually with the following command:
$GATK --java-options "-Xmx28G" HaplotypeCaller -R $REF -I !{NAME}.dm.recalibrated.bam -O !{NAME}.gvcf.gz --dbsnp $DBSNP -ERC GVCF -L $EXOME -ip 100 --bamout !{NAME}.haplotypecaller.bam
Then, i have combined the gvcfs and performed genotyping, followed by hard filtering using default parameters. For this i have used:
gatk-4.1.2.0/gatk CombineGVCFs \
-R /users/UB/UBkrabionet/Resources/hg19/bwa7/hg19.fasta \
--variant /users/UB/UBkrabionet/data/ictus/Predict/IMIM_5750/IMIM_5750.gvcf.gz \ (700 times)
--annotation AS_FisherStrand \
--annotation GenotypeSummaries \
--annotation Coverage \
--annotation AlleleFraction \
--intervals /users/UB/UBkrabionet/Resources/capturekits/agilent/v6/nochr_v6.regions.extended.bed -ip 100 \
-O /users/UB/UBkrabionet/data/ictus/combined/test.combined.g.vcf.gz
.
gatk-4.1.2.0/gatk --java-options "-Xmx30G" GenotypeGVCFs \
-R /users/UB/UBkrabionet/Resources/hg19/bwa7/hg19.fasta \
-V /users/UB/UBkrabionet/data/ictus/combined/all_combined.g.vcf.gz \
--annotation AS_FisherStrand \
--annotation GenotypeSummaries \
--annotation Coverage \
--annotation AlleleFraction \
--heterozygosity 0.0001 \
--indel-heterozygosity 0.00000125 \
--standard-min-confidence-threshold-for-calling 30 \
--intervals /users/UB/UBkrabionet/Resources/capturekits/agilent/v6/nochr_v6.regions.extended.bed -ip 100 \
-O /users/UB/UBkrabionet/data/ictus/combined/standcall30/all_combined_genotyped_heterozygosity_standcall30.vcf.gz
##CalculateGenotypePosteriors standcall
/users/UB/UBkrabionet/Tools/GATK/GATK4.1/gatk-4.1.2.0/gatk --java-options "-Xmx48G" CalculateGenotypePosteriors \
-V all_combined_genotyped_heterozygosity_standcall30.vcf.gz \
-O all_combined_genotyped_heterozygosity_standcall30_posteriors.vcf.gz \
-supporting /users/UB/UBkrabionet/Resources/1000G_phase1.snps.high_confidence.hg19.sites.vcf \
-supporting /users/UB/UBkrabionet/Resources/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
##SelectVariants standcall
/users/UB/UBkrabionet/Tools/GATK/GATK4.1/gatk-4.1.2.0/gatk SelectVariants \
-V all_combined_genotyped_heterozygosity_standcall30_posteriors.vcf.gz \
-select-type SNP \
-O all_combined_genotyped_heterozygosity_standcall30_posteriors_snps.vcf.gz
/users/UB/UBkrabionet/Tools/GATK/GATK4.1/gatk-4.1.2.0/gatk SelectVariants \
-V all_combined_genotyped_heterozygosity_standcall30_posteriors.vcf.gz \
-select-type INDEL \
-O all_combined_genotyped_heterozygosity_standcall30_posteriors_indels.vcf.gz
#Hard Filter snps standcall
/users/UB/UBkrabionet/Tools/GATK/GATK4.1/gatk-4.1.2.0/gatk VariantFiltration \
-V all_combined_genotyped_heterozygosity_standcall30_posteriors_snps.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40" \
-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
-O all_combined_genotyped_heterozygosity_standcall30_posteriors_snps_filtered.vcf.gz
#Hard Filter indels standcall
/users/UB/UBkrabionet/Tools/GATK/GATK4.1/gatk-4.1.2.0/gatk VariantFiltration \
-V all_combined_genotyped_heterozygosity_standcall30_posteriors_indels.vcf.gz \
-filter "QD < 2.0" --filter-name "QD2" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "FS > 200.0" --filter-name "FS200" \
-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
-O all_combined_genotyped_heterozygosity_standcall30_posteriors_indels_filtered.vcf.gz
#MergeVCFs standcall
java -jar /users/UB/UBkrabionet/Tools/Picard/picard.jar MergeVcfs \
I=all_combined_genotyped_heterozygosity_standcall30_posteriors_snps_filtered.vcf.gz \
I=all_combined_genotyped_heterozygosity_standcall30_posteriors_indels_filtered.vcf.gz \
O=all_combined_genotyped_heterozygosity_standcall30_posteriors_snps_indels_filtered.vcf.gz
On my final vcf file, in several samples i see many variants with a low AD (usually below 0.15). Looking at some of them in detail, they consist of many (phased) variants, including large insertions or deletions and some SNPs. I looked at the bamout from haplotype caller, compared to the recalibrated, duplicate marked bam for the same individual, and found that haplotype caller was generating an artificial haplotype supported by a few reads that included multiple (5, 8, 20 and 50 or even 100bp) insertions in the same read. Given that this happens in many different regions in my gene panel, always in a low % of reads, i am assuming this is noise, maybe from some issues during sample prep. Is there a way i can get rid of this?
these variants pass default filter values, and I cannot use VQSR as it is a quite small gene panel. I tried filtering by the AD field in format, and by combining genotype and AD, but no satisfactory result.
I have also tried raising the QD cutoff (to 2.5), and currently I am trying to see if i should use different parameters in the gvcf variant calling step, but any help would be appreciated,
Thanks,
kelly
-
Take a look at the hard filtering section of this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering
-
Hi Bhanu,
we have gone through these recommendations several times, we have also increased some of the thresholds to try to remove these variants, but, when considering all samples, even if they represent only 0.15 of reads in just one sample, these variant positions pass all these filters. When we try to filter by the format field by sample, we get an awfully formatted vcf, and in addition, this yields basically no variants pass (neither these, nor any other).
We have then gone back to the generategvcf step, where we have modifled the -min-pruning, but I am worried that this might remove real variants:
$GATK --java-options "-Xmx28G" HaplotypeCaller -R $REF -I !{READ} -O !{NAME}.gvcf.gz --dbsnp $DBSNP -ERC GVCF -L $EXOME -ip 100 --min-pruning 10 --bamout !{NAME}.haplotypecaller.bam
This seems to get rid of several of the weird calls that we observed, but not all, and I still am not convinced this might be the better option.
-
I discussed with our dev team and we realize this is something to fix in our code. The reason we see this issue is because phased variants are assigned higher likelihoods since they are more different from the reference. Thank you for bringing this up. We will work on fixing this in the future. At this time I cannot guarantee how long it will take us to fix this issue.
A workaround in the meantime would be to use picard filtervcf with argument --MIN_AB 0.2 to filter variants with lower than 0.2 or 0.15 allele balance.
-
Hi Bhanu,
I am working with Kelly on this sequencing data and despite it has been a long time since this post, we are still having the same problem.
We decided to call variants with HaplotypeCaller (GATK v4.1.6) and modified the -min-pruning:
$GATK --java-options "-Xmx28G" HaplotypeCaller -R $REF -I !{READ} -O !{NAME}.gvcf.gz --dbsnp $DBSNP -ERC GVCF -L $EXOME -ip 100 --min-pruning 10 --bamout !{NAME}.haplotypecaller.bam
At first, this filter allowed to eliminate a lot of false large insertions or deletions. However, some unreal variants remained in our final file.
We have proved HaplotypeCaller v4.2.0 to try to solve this issue, but there is no change in comparison with v4.1.6. We do not know how to avoid these variants and we would appreciate any ideas.
Thanks,
-
I am another member of the GATK team and can follow up on this for Bhanu.
After your HaplotypeCaller command, are you running any filtering steps and are those filtering steps eliminating these insertions and deletions? Could you send your false positive and sensitivity rate after filtering so that we can compare with what we would expect?
You should be able to get rid of these insertions and deletions in post-HaplotypeCaller steps. HaplotypeCaller is meant to be very sensitive to allow for haplotype discovery, and our downstream filtering steps determine if the found haplotypes are real or not. Again, we recommend allele balance filtering.
You can also look into our filtering method CNNScoreVariants, which could work well for your data and is pre-trained.
Hope this helps,
Genevieve
Please sign in to leave a comment.
5 comments