Remove the `*` ALT Allele Reporting From a g.vcf Before (or After) Genotyping
I have a large scale dataset of WXS .g.vcfs , I do not have the original FASTA files. I would like to genotype these .g.vcfs into vcfs. These vcfs will be used in downstream applications. The issue is that the g.vcfs were created with GATK HaplotypeCaller without the "--disable-spanning-event-genotyping" flag set to true. This means that ```*``` SNPs are introduced so that there is a reported "SNP" at a specific site that an upstream INDEL is spanning (INDEL overlapping a SNP) . This is an issue because I am not interested in ```*``` genotypes - my understanding is that this information is already stored in the upstream deletion - and the ```*``` are not standard IUPAC base annotation and thus triggers errors in many downstream applications.
What is the correct way to get rid of the reporting to these ```*``` ALT alleles?
I know this isnt GATK but I tried using ```bcftools merge
`` I noted the `-m **` flag, so I tried this first on the VCF after genotyping:
./bcftools merge **-m none,\*\*** --force-single Genotyped_Sample.vcf.gz -Oz -o merged_Genotyped_Sample.vcf.gz
and then on the original, pre-genotyped g.vcf:
./bcftools merge **-m none,\*\*** --force-single Sample.g.vcf.gz -Oz -o merged_Sample.g.vcf.gz
Neither successfully remove the ```*``` ALT alleles.....
ALSO! Is this reporting ( ```*```) compatible with trio sequencing that outputs phased genotypes? I am thinking it does not because I am getting some ```*``` SNP alleles being phased to an allele without an upstream deletion.
Here is this example; In the trio VCF - so 2 parents and the patient - I find:
#CHROM POS REF ALT
chr1 154590147 CCG C
chr1 154590148 CG C
chr1 154590149 G *
chr1 154590149 G C
and then, when I just extract the patient genotypes using bcftools query
:
#CHROM POS REF ALT GT
chr1 154590148 CG C 0|1
chr1 154590149 G * 1|0
chr1 154590149 G C 0|1
-
We do not recommend modifying GVCF directly as it may cause even further issues. The best method would be to remove them after genotyping them by using some bash awk grep combination or you can use JEXL with our SelectVariants tool to remove those symbolics. Here is an example JEXL that you can use (After splitting multiallelics to biallelics of course).
-select '!vc.getAlternateAllele(0).toString().equals("*")'
I hope this helps.
-
Gökalp Çelik Ok, I just want to confirm that it is ok that, once youve split multiallelics to biallelics, it is ok to just `grep` or `SelectVariants` out any variants that have `*` as the alt allele?
I just want to confirm, is it just the case that this annotation system is not compatible with trio-based phasing/phasing? I was initially thinking this was the case until I saw instances of the `*` allele being reported on the opposite allele of the upstream INDEL:
#CHROM POS REF ALT GT chr1 154590148 CG C 0|1 chr1 154590149 G * 1|0 chr1 154590149 G C 0|1
Thanks so much!
-
Hi again.
Phase-by-transmission may require those symbolics to be present as place holder for spanning deletions or those sites with spanning deletions should be present as COMPLEX Multiallelics such as
chr1 154590148 CG CC,C 1|2
This way you may be able to produce a proper phased INDEL and SNP in a spanning region. Otherwise information may get lost if they are represented purely as split biallelics.
-
Ok, so you are suggesting that these symbolics might be necessary for phasing as they assist in the proper phasing of INDELs and SNPs that overlap, but after phasing they can be simply removed because they are no longer useful or informative?
-
Looking at the above sample you provided that INDEL seems to be phased incorrectly if that symbolic is not present. You will be observing both INDEL and SNP on the same allele. So removing the symbolic may not be the best option you have here.
-
I thought I wrote out an answer and submitted but I dont see it :(((
I was clarifying that for the example shown, I did not remove the `*` alleles from the g.vcf before the genotyping + phasing steps. The result shown was taking the final genotyped+phased VCF and pulling the genotypes for the child in the trio (using bcftools query) in that particular regions.Here is an example with more information straight from the genotyped + phased trio VCF, where at no step did I remove the `*` alleles. In the first 2 sample columns is the mother and father, and the last column is the child:
chr2 213147679 . CAA C PASS 0|0:89,7:96:52:21:21:0|1:213147679_CAA_C:0,21,2979:0,52,3045:213147679 0|1:73,9:82:62:21:21:0|1:213147679_CAA_C:93,0,2513:62,0,2569:213147679 0|1:76,13:90:99:21:21:0|1:213147679_CAA_C:229,0,2565:198,0,2621:213147679 chr2 213147681 rs6738070 A C PASS 1|1:0,89:96:52:1|1:213147679_CAA_C:3857,364,0:3881,373,0:213147679 0|1:1,72:82:91:1|0:213147679_CAA_C:3874,422,122:3867,400,91:213147679 0|1:0,76:90:99:1|0:213147679_CAA_C:3746,559,229:3739,537,198:213147679 chr2 213147681 rs6738070 A * PASS 0|0:0,7:96:52:1|1:213147679_CAA_C:3857,2862,2979:3881,2905,3045:213147679 1|0:1,9:82:91:1|0:213147679_CAA_C:3874,2444,2486:3867,2456,2521:213147679 1|0:0,13:90:99:1|0:213147679_CAA_C:3746,2516,2565:3739,2528,2600:213147679
As one can see, the SNP with the `*` allele has been phased to the other allele. How can this be? Is this an error with phasing? Given your information (thank you), it seems realistic that the `*` alleles assists with phasing but this final output seems weird.
My original g.vcfs are formatted as multiallelic with with both *, <nonref> , example of one line here:
chr1 35325351 . C *,T,CTTT,CTTTTTTTTT,<NON_REF>
-
Hi again.
Let me clarify one additional point here. If you are solely relying on HaplotypeCaller's phasing it is not a primary goal of this tool but it is implemented in a way that can complement the tools actual function. The actual phased genotype is kept in FORMAT/PGT tag and GT tag may lose its phasing information or may show wrong phasing as a result of the joint genotyping step.
Here is an example from your samples
GT:AD:DP:GQ:PGT:PID 0|1:0,76:90:99:1|0:213147679_CAA_C
This is a known behavior and in fact it will probably stay that way since changing the behavior requires extensive calculation of all phased genotypes and alleles present in the region. When there is a multiallelic region with more than 3 4 alleles present number of calculations required to keep the order of alleles get significantly higher therefore instead PGT tag is left to do its job. Below was an experimental PR generated by another user for solving this issue however it gets complicated as the number of alleles increase within the locus and requires extensive book keeping for each sample and allele and ploidy level therefore it gets almost impossible to track edge cases that may result in errors.
https://github.com/broadinstitute/gatk/pull/8570
Long story short you can remove those locus with * as ALT and rely on PGT tag to see how the final phasing occur. For a more proper phasing by transmission we used to have a tool in GATK3 named PhaseByTransmission. It is unsupported currently but you may give it a spin to see if it works for you. Otherwise there are other tools such as beagle which can perform phase by transmission
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3217888/
I hope this helps.
-
Yes, that is extremely useful and insightful. I guess my only followup question is: why does the GT tag even report phased genotypes if the "true" phasing information can be lost/jumbled, with the more correct phasing being present in the PGT tag?
-
It is a part of the VCF spec that if a variant is tagged as phased, GT tag must have "|" . If not variant cannot be selected as phased by HTSLIB/HTSJDK by using isPhased() method. Due to that if a variant is phased VariantContextBuilder/GenotypeBuilder object automatically places "|" instead of "/" regardless of the alleles phasing. So this is still the intended behavior.
Please sign in to leave a comment.
9 comments