VCF allelic depth (AD) supports the reference or alternate?
I used GATK 4.2.5 HaplotypeCaller to generate vcfs for my samples. I called them as haploids and and am looking for clarification regarding allelic depth. The format is as follows:
GT:AD:DP:GQ:PL 1:5,14:19:99:405,0
I could not find this answer elsewhere, but I am confused about the allelic depth order. For all heterozygous calls, the first allelic depth is less than the second (e.g. 5,14). Does this mean that 14 reads support the alternate allele here because the genotype was 1? For example, if it was "0:5,14:19:99:405,0", would this mean 14 reads support the reference?
I see this in the vcf header, but the second number is always larger.
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
Maybe this is because I called it as a haploid? Thanks in advance for help with interpretation.
-
The main identifier is in the GL/PL field where the likelihood of the called genotype is expressed in -10*log scale. Looking at your FORMAT field the likelihood of that site being called as REF is 1e-405 and ALT is 1e-0.
AD field represents reads found supporting either the reference allele or the alternate allele at each site. Though it is correlateable this tuple is not the sole source of your calls.
The other example your provided would violate the VCF entry since your PL value for ALT call is still the largest one yet the site is called REF, which is unlikely.
I hope this helps.
Regards.
-
Hi SkyWarrior, Thank you for a quick response.
I understand the likelihood portion, but am still curious about which read depth corresponds to each genotype. I made up the second example but am still not understanding the AD field. I assume the ploidy I called the vcf with would impact the PL field and am not confident the organism is a haploid.
Does the second allelic depth reported always correspond to the genotype? For example, would the following be correct:0:5,14 = 14 reads supporting the reference allele
1:5,14 = 14 reads support the alternate allele
I hope this is clear, thank you!
-
Oh I see. AD tuples or let's say arrays are designed to support the alelles in their given order. Allele number 0 is always the reference allele. Anything above index 0 is considered an alternate allele. Therefore AD[0] which in your case is 5 corresponds to the reference allele and AD[1], which is 14 , is for the first alternate allele observed in that locus.
I hope this helps.
-
SkyWarrior Assuming the allelic depth is always reported in the order of reference [0], alternate [1]: I find it odd that regardless of the genotype called, the first number is always lower than the second. Are you sure the order doesn't change depending on the genotype? I plotted the depth of the first allelic depth and second allelic depth:
-
Can you provide us conflicting VCF entries from your results that could have caused this confusion?
-
Hello SkyWarrior, sorry for the delayed response. There is not any conflicting data in my vcf, but rather the absence of entries I expected to see.
I searched my vcf (276 whole-genomes) using the following line, but found no instances that met this criteria. I also visually inspected parts of the vcf.
```grep -E '^(0|1):([1-9][0-9]*),([0-9]+)$' file.vcf
```
I would have expected to see outputs that might look like this "0:25,3" as part of the vcf, especially considering I see the opposite such as "1:3,31". I do not understand why the ratio of reference (AD[0]) to alternate (AD[1]) is either 0 or greater than 0.5 (regardless of genotype). In other words, I do not find any heterozygous sites with the reference genotype called, but plenty of heterozygous calls with the alternate allele genotype.
Perhaps there is someone with a good understanding of the code that can explain the order of allelic depth reported in haploid vcfs. This situation might be odd considering I did not originally expect heterozygous calls, but I am hoping to work with the vcfs I already have.I am not sure if there is a good biological (or otherwise) explanation for this, but if you can think of anything please let me know. Thank you!
-
Have you performed a joint genotyping on your samples? Did you run HaplotypeCaller in -ERC GVCF mode? If the answer to those questions is "no" then it is totally normal that you do not observe any 0:25,3 entries in your VCF files as HaplotypeCaller will not emit HOM_REF sites in its default form. In order to get those HOM_REF sites you need to run with -ERC GVCF or -ERC BP_RESOLUTION along with outputting all sites during GenotypeGVCFs step.
My question prior was that whether you have observed any variants in your VCF file that shows 0:5,14 in a sample.
I hope this helps.
-
Hi Gökalp Çelik
Here are the commands I used to generate the vcf:
```
gatk-4.2.5.0/gatk --java-options "-Xmx100g" HaplotypeCaller -R ref.fa -ERC GVCF -I /Plates23/A_B/${base}.bam -O /${base}_aln_HCgatk.g.vcf.gz --sample-ploidy 1
find /Plates23 -type f -name "*.vcf.gz" > /haploid/Names_Plates123_VCF.list
gatk-4.2.5.0/gatk --java-options "-Xmx50g" CombineGVCFs -R ref.fa --variant /haploid/Names_Plates123_VCF.list -O /haploid/BcinereaP123.g.vcf
gatk-4.2.5.0/gatk --java-options "-Xmx50g" GenotypeGVCFs -R ref.fa -V /haploid/BcinereaP123.g.vcf -O /haploid/BcinereaP123_genotype.vcf
```
Do you notice anything wrong with these lines?
Yes, I have variants that show "0:5,14". These are the variants I used to generate the graphs above. The top graph is ref:alt read depth ratios (5 ref, 14 alt for example), so you can see that there was a lack of het_ref found (which would be observed in the other half of the graph).
Thank you for all the help!
*** Edit to add:
I tried running this briefly on the first chromosome and still don't see het_ref sites:
gatk-4.2.5.0/gatk --java-options "-Xmx50g" GenotypeGVCFs -all-sites TRUE -R ref.fa -V /haploid/BcinereaP123.g.vcf -O /haploid/BcinereaP123_genotype_allsites.vcf -
Hi again.
Have you checked those sites in IGV to confirm that the opposite representation of alleles is the case for those sites?
I can think of possible reasons why this is occurring. Depending on the type of data generated for these calls HaplotypeCaller's behavior might be counterintuitive such as amplicon data. If the ref allele is supported by high quality unbiased bases but alt is supported only by low quality and highly biased bases then HaplotypeCaller will call the site as ref as it's qual will suppress the call for the alt. PL values will actually give clues about that. For all of those sites where you have a REF call with high AD[1] values can you check the status of PL to see if it supports ALT or REF (value of 0 will support which allele is called).
If it is still an issue we may ask you to provide us a snippet data that we can reproduce the calls here in our hands.
I hope this helps.
-
Hi Gökalp Çelik,
Here is a small snippet from igv showing read depth, but I can tell that there are heterozygous reference and heterozygous alternate variants in the bam files. I think this is what you are asking?
My vcf does not contain any sites where there is a REF call with high AD[1] values (or any AD[1] values above 0, for that matter). There are no heterozygous REF calls at all.
For example, here is a line in IGV that looks like it would be called as a heterozygous reference:
However, this is the line in the vcf:
1 1408393 . A . 4412.51 PASS AN=1;BaseQRankSum=1.14;DP=29;FS=0.612;MQ=46.6;MQRankSum=-0.99;QD=23.22;ReadPosRankSum=0.697;SOR=0.775 GT:AD:DP:GQ:PL 0:29:29:99:0Here is another example:
1 1408458 . T . 1004.14 PASS AN=0;BaseQRankSum=0;DP=36;FS=0;MQ=46.11;MQRankSum=0.674;QD=26.15;ReadPosRankSum=0.674;SOR=0.707 GT:AD:DP:GQ:PL .:36:36:0:0
Do you know what would explain this? I hope I am explaining it all well enough. I would be happy to share data if that is helpful. Thank you.
-
There is really no definition of heterozygous REF calls in gatk therefore you won't observe anything like that. However looking at the bam files that you provided either your sequencing library is not from a single colony and you may have a mixture of 2 different strains or there is a n issue about the reference genome of the organism that you are working with such that paralogs of the same gene are getting mapped onto a single locus. In this case HaplotypeCaller will do it's best to generate a consistent haplotype however when there are more than one possible haplotype with equal weights HaplotypeCaller may return a NO_CALL at the locus due to tie, which I can see one in the last example.
Those variant contexts that you provided are from a single sample VCF therefore you do not observe a tuple of AD[0] and AD[1]. On the other hand when these samples are combined and genotyped together you will observe more than one AD[0], AD[1].... depending on the number of alternate alleles present in the loci. In the joint version each AD will correspond to the allele order which AD[0] is always the reference allele. Others will follow depending on the order that they are listed in the variant context. This order is generated by HTSJDK/HTSLIB VariantContextBuilder objects during the output step.
I am still confused about what the real problem here is since it all looks normal from the way AD tags are set.
I hope this helps.
-
Hi again Gökalp Çelik
I am trying to see if this is the result of heterokaryotic behavior (2 genetically distinct nuclei in the same cell) in a haploid organism. It is a long story, but other evidence suggests this is not cross-sample contamination and the nuclei may be very similar overall. I didn't want to call it as a diploid, but maybe that is the way to go if this method will miss some of the heteroallelic sites?
I know this has been a long discussion and I am thankful for all of your responses! -
Absolutely. Heterokaryotic behavior will produce calls that is no different than a normal diploid organism. İn that sense calls made with haploid parameter will most certainly be not compatible with what you expect.
Please sign in to leave a comment.
13 comments