Haplotypecaller FORMAT:DP is affected by the interval in WGS
I am using Haplotypecaller without BQSR/VQSR for a non-model organism.
I am interested in the depth over a single basepair coordinate, since it shows an anomaly with what I see in the browser (don't sigh yet -- I have read the articles about why this might be and have tried to understand/rule out those reasons. I'm not claiming that I have fully ruled those out, though I haven't been able to rule them in, either).
CP022325.1:69348
I have discovered that by providing different intervals, all of which cover a given base pair coordinate, that the FORMAT:DP is reported inconsistently.
I have read the article on how AD and DP are calculated, and I also found the following forum post:
https://gatk.broadinstitute.org/hc/en-us/community/posts/4406519538587-HaplotypeCaller-inconsistency-with-different-interval-lists
from 3 years ago which notes similar behavior. Based on the response to that post, I read the linked article "When Haplotypecaller and mutect2 do not call an expected variant". That lead me to examine downsampling as a potential cause -- it is not. I both adjusted the downsampling thresholds on haplotypecaller, and additionally used DepthOfCoverage (command below. I did make the attempt to make the filters applied by DepthOfCoverage match the default filters on Haplotypecaller, which is what I am using as I examine why the intervals affect DP).
REQUIRED for all errors and issues:
a) GATK version used:
The Genome Analysis Toolkit (GATK) v4.3.0.0
HTSJDK Version: 3.0.1
Picard Version: 2.27.5
b) Exact command used:
With two samples, I have used both CombineGVCFs without intervals and GenomicsDBImport with intervals. I did get the same results with those tools. I am choosing to use CombineGVCFs here in order to isolate the intervals argument to Haplotypecaller. To examine different sets of intervals, I am using `-L $INTERVAL` in Haplotypecaller, and then running through the same chain. I have also examined the gVCF file itself from haplotypecaller.
```
gatk HaplotypeCaller \
-I $INPUT_BAM \
-R $REFERENCE_GENOME \
-O $OUTPUT_VCF \
-ERC GVCF \
--sample-ploidy 1
# Run CombineGVCFs to merge the GVCF files
gatk CombineGVCFs \
-R $REFERENCE_GENOME \
-V $INPUT_GVCF_1 \
-V $INPUT_GVCF_2 \
-O $OUTPUT_COMBINED_GVCF
gatk GenotypeGVCFs \
-R $REFERENCE_GENOME \
-V gendb://$OUTPUT_COMBINED_GVCF \
-O $OUTPUT_VCF
```
To examine the files with `DepthOfCoverage`:
```
gatk DepthOfCoverage \
-R $REFERENCE_GENOME \
-O $OUTPUT \
-I $INPUT_ALN \
-L $INTERVAL \
--read-filter GoodCigarReadFilter \
--read-filter NonZeroReferenceLengthAlignmentReadFilter \
--read-filter PassesVendorQualityCheckReadFilter \
--read-filter MappingQualityAvailableReadFilter \
--read-filter MappingQualityReadFilter
```
The result of DepthOfCoverage over strictly a single basepair:
```
Locus,Total_Depth,Average_Depth_sample,Depth_for_DA10_DA10
CP022325.1:69348,17,17.00,17
```
Note that the depth is 17 for sample DA10
The result of Haplotypecaller --> GenotypeGVCF run with chromosome length intervals. Note that FORMAT:DP for sample DA10 is 4
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DA10_DA10 TDY1754_TDY1754 CP022325.1 69348 . T A 2920.63 . AC=1;AF=0.5;AN=2;DP=85;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=31.95;SOR=0.824 GT:AD:DP:GQ:PL 0:4,0:4:99:0,100 1:0,79:79:99:2931,0
The result of Haplotypecaller --> GenotypeGVCF with a single basepair interval. Note that FORMAT:DP for sample DA10 is now 16
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DA10_DA10 TDY1754_TDY1754 CP022325.1 69348 . T A 2920.63 . AC=1;AF=0.5;AN=2;DP=97;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=25.36;SOR=0.824 GT:AD:DP:GQ:PL 0:16,0:16:99:0,622 1:0,79:79:99:2931,0
I am, for my application, interested in the AD / DP counts. Possibly I should not use haplotypecaller for this -- maybe DepthOfCoverage would be better?
But, more generally, all else held constant, I am confused as to why FORMAT:DP would be different based on the interval provided.
-
Looking at your results, it looks like only one of those samples gets affected by this change but the other sample is showing a robust result. This could be due to a spanning deletion or a secondary mutation overlapping or neighboring the same interval for DA10.
You mentioned that you have looked at our documents on why variant calling might get affected by certain factors in HaplotypeCaller/Mutect2 but let me rephrase.
Both tools utilize a local reassembly and realignment to capture variants from sequencing data therefore 2 processes are using whatever reference data and read data provided to them to generate reassembled haplotypes. Certain conditions might affect the outcome of this reassembly such as providing too little reference genome information as intervals. Reassembly uses the read data provided within the interval you provide (and plus some due to some internal calculation for intervals). The more reads provided to the reassembly engine the better the resolution of haplotypes occur especially in hard to sequence regions with high GC/repeat content. Sometimes this results in loss of certain variants that you observe by eye in IGV and this is all due to similarities observed in close regions to the actual variant site that you are expecting. It is not necessarily a bad behavior by this tool since even aligners are not 100% sure about where reads should go when Q score is around 30 and less and region to map is hard to read by short reads. Some aligners prefer certain arrangements due to insert size estimations however unless you exactly know your insert sizes it is only an estimate.
Coming back to your question about depths, if you are absolutely sure about using this metric to call certain features in your application the best method would be to use a pileup with very strict parameters to eliminate duplicate, overlapping reads and badly mapped pairs as well as base and mapping qualities below certain thresholds. On the other hand HaplotypeCaller with its latest enhancements on variant calling supports using pileups for variant calls where local reassembly fails to provide better resolution for active regions. You may activate this behavior using the parameter below. Of course we recommend using the latest GATK 4.6.0.0.
--use-pdhmm <Boolean> Partially Determined HMM, an alternative to the regular assembly haplotypes where we
instead construct artificial haplotypes out of the union of the assembly and pileup
alleles. Default value: false. Possible values: {true, false}I hope this helps.
Regards.
-
I wrote a reply, hit submit, and it isn't appearing. Not sure if there is a wait time, so apologies if this is a repeat in part.
I'm trying v4.6.0 now. However,
```
gatk HaplotypeCaller \
-I $INPUT_BAM \
-R $REFERENCE_GENOME \
-O $OUTPUT_VCF \
-ERC GVCF \
--use-pdhmm true
```produces the following error:
```java.lang.IllegalArgumentException: refHaplotype cannot be null
```
which may be expected -- not sure if that is intended to be used with GVCF anyway. Without the GVCF output, the command does execute.I am trying `--pileup-detection` now -- it isn't quite clear to me yet what the difference between this, `--use-pdhmm` and the default mode is from the docs, though -- does there happen to be an article about this?
Back to my region of interest, though: if this were an active region in the sample I am examining, then I could use the --bam-out to see the local re-alignment. Unfortunately, the discrepancy is happening over a region that is entirely REF with MAPQ scores of 60. Do you have any suggestions as to how I can better understand why these reads are being so affected by the interval?
This is what the reads from 69200 to 69600 (from the left hand side of that image to 200 bp beyond the right hand side) look like
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN raw total sequences: 85 # excluding supplementary and secondary reads
SN filtered sequences: 0
SN sequences: 85
SN is sorted: 1
SN 1st fragments: 41
SN last fragments: 44
SN reads mapped: 85
SN reads mapped and paired: 85 # paired-end technology bit set + both mates mapped
SN reads unmapped: 0error rate: 7.843137e-04 # mismatches / bases mapped (cigar)
SN reads properly paired: 84 # proper-pair bit set
SN reads paired: 85 # paired-end technology bit set
SN reads duplicated: 1 # PCR or optical duplicate bit set
SN reads MQ0: 0 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 0
SN supplementary alignments: 0
SN total length: 6375 # ignores clipping
SN total first fragment length: 3075 # ignores clipping
SN total last fragment length: 3300 # ignores clipping
SN bases mapped: 6375 # ignores clipping
SN bases mapped (cigar): 6375 # more accurate
SN bases trimmed: 0
SN bases duplicated: 75
SN mismatches: 5 # from NM fields
SN error rate: 7.843137e-04 # mismatches / bases mapped (cigar)
SN average length: 75
SN average first fragment length: 75
SN average last fragment length: 75
SN maximum length: 75
SN maximum first fragment length: 75
SN maximum last fragment length: 75
SN average quality: 34.0
SN insert size average: 296.8
SN insert size standard deviation: 65.9
SN inward oriented pairs: 26
SN outward oriented pairs: 0
SN pairs with other orientation: 0
SN pairs on different chromosomes: 0
SN percentage of properly paired reads (%): 98.8
MAPQ 60 84
which I read as very high quality alignments (all are MAPQ 60) with only 1 marked as a duplicate. The error rate is less than 0.1%, which is reasonable. -
It is really hard to tell anything from this image. Can you disable View as Pairs option as well as show us the actual variant location that we can assess? You can also share a snippet bam file if you wish us to diagnose the issue. Instructions on how to send a bam snippet can be found in the article below.
https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report
-
The variant location is the BP in identified by the vertical bars, but yes, the browser isn't helpful here.
I've attached a subset of the data. On this data, I ran the following commands to produce a result without specifying an interval, and then again with the interval. I am using GATK 4.6.0 now.
```gatk HaplotypeCaller \
-I $INPUT \
-R $REFERENCE_GENOME \
-O $OUTPUT_VCF \
-ERC GVCF \
--sample-ploidy 1gatk CombineGVCFs \
-R $REFERENCE_GENOME \
-V $INPUT_GVCF_1 \
-V $INPUT_GVCF_2 \
-O $OUTPUT_COMBINED_GVCFgatk GenotypeGVCFs \
-R $REFERENCE_GENOME \
-V $OUTPUT_COMBINED_GVCF \
-O $OUTPUT_VCF```
which yields this
```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DA10_DA10
CP022325.1 69348 . T A 2920.63 . AC=0;AF=0.5;AN=1;DP=85;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.56;SOR=0.824 GT:AD:DP:GQ:PL 0:4,0:4:99:0,119
```
And then I adjusted the interval over which I am calling in haplotypecaller. The other two commands are the same:
```
gatk HaplotypeCaller \
-I $INPUT \
-R $REFERENCE_GENOME \
-O $OUTPUT_VCF \
-ERC GVCF \
--sample-ploidy 1 \
-L "CP022325.1:69330-69370"
```
and this is the output over the specific base pair in question (which is variant in TDY1754)
```
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DA10_DA10
CP022325.1 69348 . T A 2920.63 . AC=0;AF=0.5;AN=1;DP=97;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=25.36;SOR=0.824 GT:AD:DP:GQ:PL 0:16,0:16:99:0,585
```I uploaded the subset data -- two bams and their indicies for samples DA10 and TDY1754. They're in:
`cmateusiak_20240805.tar.gz`The reference genome can be downloaded from fungidb:
-
This looks like a bug. I will escalate this issue to a github issue. I will post the issue link here once added.
Regards.
You can check the issue from the link below.
-
I'm not sure that this is helpful, but from looking at the gVCF files, it seems like there are multiple intervals, all with depth information, which span the base pair in question. When the interval is widened, a wider interval, the first in the gVCF file, is used for the depth in the genotyped cohort VCF. The larger spanning interval has lower depth than the narrow interval, which is used when the interval is narrowed.
-
This is probably a bug in the reference confidence emission algorithm when ploidy is set to 1. Our team will look into that. If genotype calls is not important for you but just using DPs then for the time being you may set ploidy to 2 to capture proper DP values no matter what interval you use.
Please sign in to leave a comment.
7 comments