Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Remove the `*` ALT Allele Reporting From a g.vcf Before (or After) Genotyping

1

9 comments

  • Avatar
    Gökalp Çelik

    Hi Jonathan Klonowski

    We do not recommend modifying GVCF directly as it may cause even further issues. The best method would be to remove them after genotyping them by using some bash awk grep combination or you can use JEXL with our SelectVariants tool to remove those symbolics. Here is an example JEXL that you can use (After splitting multiallelics to biallelics of course).

    -select '!vc.getAlternateAllele(0).toString().equals("*")'

    I hope this helps.

    0
    Comment actions Permalink
  • Avatar
    Jonathan Klonowski

    Gökalp Çelik Ok, I just want to confirm that it is ok that, once youve split multiallelics to biallelics, it is ok to just `grep` or `SelectVariants` out any variants that have `*` as the alt allele?

    I just want to confirm, is it just the case that this annotation system is not compatible with trio-based phasing/phasing? I was initially thinking this was the case until I saw instances of the `*` allele being reported on the opposite allele of the upstream INDEL:

    #CHROM  POS         REF   ALT     GT
    chr1     154590148   CG  C      0|1
    chr1     154590149   G   *      1|0
    chr1     154590149   G   C      0|1

    Thanks so much!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Phase-by-transmission may require those symbolics to be present as place holder for spanning deletions or those sites with spanning deletions should be present as COMPLEX Multiallelics such as 

    chr1     154590148 CG CC,C 1|2

    This way you may be able to produce a proper phased INDEL and SNP in a spanning region. Otherwise information may get lost if they are represented purely as split biallelics. 

    0
    Comment actions Permalink
  • Avatar
    Jonathan Klonowski

    Ok, so you are suggesting that these symbolics might be necessary for phasing as they assist in the proper phasing of INDELs and SNPs that overlap, but after phasing they can be simply removed because they are no longer useful or informative?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Looking at the above sample you provided that INDEL seems to be phased incorrectly if that symbolic is not present. You will be observing both INDEL and SNP on the same allele. So removing the symbolic may not be the best option you have here. 

    0
    Comment actions Permalink
  • Avatar
    Jonathan Klonowski

    I thought I wrote out an answer and submitted but I dont see it :(((

    I was clarifying that for the example shown, I did not remove the `*` alleles from the g.vcf before the genotyping + phasing steps. The result shown was taking the final genotyped+phased VCF and pulling the genotypes for the child in the trio (using bcftools query) in that particular regions.

    Here is an example with more information straight from the genotyped + phased trio VCF, where at no step did I remove the `*` alleles. In the first 2 sample columns is the mother and father, and the last column is the child:

    chr2    213147679   .   CAA C   PASS    0|0:89,7:96:52:21:21:0|1:213147679_CAA_C:0,21,2979:0,52,3045:213147679  0|1:73,9:82:62:21:21:0|1:213147679_CAA_C:93,0,2513:62,0,2569:213147679  0|1:76,13:90:99:21:21:0|1:213147679_CAA_C:229,0,2565:198,0,2621:213147679
    
    chr2    213147681   rs6738070   A   C   PASS    1|1:0,89:96:52:1|1:213147679_CAA_C:3857,364,0:3881,373,0:213147679  0|1:1,72:82:91:1|0:213147679_CAA_C:3874,422,122:3867,400,91:213147679   0|1:0,76:90:99:1|0:213147679_CAA_C:3746,559,229:3739,537,198:213147679
    
    chr2    213147681   rs6738070   A   *   PASS    0|0:0,7:96:52:1|1:213147679_CAA_C:3857,2862,2979:3881,2905,3045:213147679   1|0:1,9:82:91:1|0:213147679_CAA_C:3874,2444,2486:3867,2456,2521:213147679   1|0:0,13:90:99:1|0:213147679_CAA_C:3746,2516,2565:3739,2528,2600:213147679
    

    As one can see, the SNP with the `*` allele has been phased to the other allele. How can this be? Is this an error with phasing? Given your information (thank you), it seems realistic that the `*` alleles assists with phasing but this final output seems weird.

    My original g.vcfs are formatted as multiallelic with with both *, <nonref> , example of one line here:

    chr1    35325351        .       C       *,T,CTTT,CTTTTTTTTT,<NON_REF>
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    Let me clarify one additional point here. If you are solely relying on HaplotypeCaller's phasing it is not a primary goal of this tool but it is implemented in a way that can complement the tools actual function. The actual phased genotype is kept in FORMAT/PGT tag and GT tag may lose its phasing information or may show wrong phasing as a result of the joint genotyping step.

    Here is an example from your samples

    GT:AD:DP:GQ:PGT:PID     0|1:0,76:90:99:1|0:213147679_CAA_C

    This is a known behavior and in fact it will probably stay that way since changing the behavior requires extensive calculation of all phased genotypes and alleles present in the region. When there is a multiallelic region with more than 3 4 alleles present number of calculations required to keep the order of alleles get significantly higher therefore instead PGT tag is left to do its job. Below was an experimental PR generated by another user for solving this issue however it gets complicated as the number of alleles increase within the locus and requires extensive book keeping for each sample and allele and ploidy level therefore it gets almost impossible to track edge cases that may result in errors. 

    https://github.com/broadinstitute/gatk/pull/8570 

    Long story short you can remove those locus with * as ALT and rely on PGT tag to see how the final phasing occur. For a more proper phasing by transmission we used to have a tool in GATK3 named PhaseByTransmission. It is unsupported currently but you may give it a spin to see if it works for you. Otherwise there are other tools such as beagle which can perform phase by transmission 

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3217888/ 

    I hope this helps. 

    1
    Comment actions Permalink
  • Avatar
    Jonathan Klonowski

    Yes, that is extremely useful and insightful. I guess my only followup question is: why does the GT tag even report phased genotypes if the "true" phasing information can be lost/jumbled, with the more correct phasing being present in the PGT tag?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    It is a part of the VCF spec that if a variant is tagged as phased, GT tag must have "|" . If not variant cannot be selected as phased by HTSLIB/HTSJDK by using isPhased() method. Due to that if a variant is phased VariantContextBuilder/GenotypeBuilder object automatically places "|" instead of "/" regardless of the alleles phasing. So this is still the intended behavior. 

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk