Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GATK not replacing variants in reference sequence

0

9 comments

  • Avatar
    Gökalp Çelik

    Hi Ryan Scalsky

    Can you post the variants you are interested in to be included in the reference as they appear in the VCF file?

     

    0
    Comment actions Permalink
  • Avatar
    Ryan Scalsky

    This run was for sample: SPT43376

    This is the respective section of the VCF after filtering to only contain heterozygous variants. 

    ##bcftools_viewCommand=view -i 'GT="0/1" || GT="1/0"' -Oz -o temp_SPT.vcf.gz SampleFiltered_Pf3D7_07_v3_PF3D7_0703900_SPT43376.recode_pass_only.recode.vcf; Date=Mon Mar 25 15:00:17 2024
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SPT43376
    Pf3D7_07_v3     163428  .       TTATTTA TTATTTATATTTA,TTATTTATATTTATATTTA,TTATTTATATTTATATTTATATTTA,T,TTATTTATATTTATATTTATATTTATATTTA,TTATTTATTTTTATATTTA       2.42157e+07     PASS
        .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:12,198,0,0,0,0,0:210:99:.:.:5989,0,187,6029,791,6819,6029,791,6819,6819,6029,791,6819,6819,6819,6029,791,6819,6819,6819,6819,6029,791,6819,6819,6819,6819,6819:.
    Pf3D7_07_v3     164456  .       ATTATTATTATTATTATTATTATT        *,TTTATTATTATTATTATTATTATT,GTTATTATTATTATTATTATTATT     1.76884e+06     PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS
           0/1:4,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314:.
    Pf3D7_07_v3     164457  .       T       *,C,TTAC        1.88546e+06     PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314:.
    Pf3D7_07_v3     164458  .       TATTATTATTATTA  *,TATCATTATTATTATTA,CATTATTATTATTA,TATTATTACTATCATTATTATTATTA,TACTATCATTATTATTATTA      1.60591e+06     PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314,1299,1320,2314,2314,2314,1299,1320,2314,2314,2314,2314:.
    Pf3D7_07_v3     164459  .       ATTATT  *,GTTATT        1.33498e+06     PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0:69:99:.:.:2215,0,2140,1299,1320,2314:.
    Pf3D7_07_v3     164460  .       TTATTA  *,CTATTA,TTACTATTA      1.49859e+06     PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314:.
    Pf3D7_07_v3     164461  .       TATTATTA        *,CATTATTA,TACTATCATTATTA,TATTACTATCATTATTA,TATTATTACTATCATTATTA,TATTATTATTACTATCATTATTA        1.52404e+06     PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0,0,0,0,0:69:99:.:.:2215,0,2140,1299,1320,2314,1299,1320,2314,2314,1299,1320,2314,2314,2314,1299,1320,2314,2314,2314,2314,1299,1320,2314,2314,2314,2314,2314:.
    Pf3D7_07_v3     164462  .       ATTAT   *,GTTAT 770480  PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0:69:99:.:.:2215,0,2140,1299,1320,2314:.
    Pf3D7_07_v3     164463  .       T       *,C     825886  PASS    .       GT:AD:DP:GQ:PGT:PID:PL:PS       0/1:4,0,0:69:99:.:.:2215,0,2140,1299,1320,2314:.
    (END)

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Ryan Scalsky

    Your input VCF seems to contain multiple alleles and especially complex overlapping indels and symbolics ("*") which altogether will suppress any nucleotide substitution. 

    Here is the document for the tool

    https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker

    Here are points to consider when using this tool.

    • If there are multiple variants that start at a site, it chooses one of them randomly.
    • When there are overlapping indels (but with different start positions) only the first will be chosen.
    • This tool works only for SNPs and for simple indels (but not for things like complex substitutions).

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Ryan Scalsky

    This is helpful but it doesn't explain why the SNPs, for example the last variant located at 164463  , are not being substituted in. I've ready the points you've listed in the documents for the tool but it isn't clear what defines a "complex substitution". Would we expect the tool to abort all function if any complex substitution is present or simply to ignore that variant?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    The symbolics are probably reason why there is no substitution. Can you remove all those symbolics and just keep biallelic SNP and INDELS to see which ones get substituted when you run the tool with those? 

    0
    Comment actions Permalink
  • Avatar
    Ryan Scalsky

    Gökalp, 

    This makes sense. I imagine the tool cannot contribute missing or unknow variants, I see in this example all variants contain a symbolic. I will remove these then try again. Thank you!

    Best, 

    Ryan 

    0
    Comment actions Permalink
  • Avatar
    Ryan Scalsky

    I reran the command with symbolics removed in one VCF and in the other I reran it with symbolics removed and only heterozygous sites included in the VCF. Both appear to have worked in that the sequences are not the same as the reference but they should be identical. For some reason they're not. 

    It appears that the symbolics were causing the script to abort. In reviewing the VCF with symbolics removed but not restricted to only variants that deviate from reference there appear to be insertions and deletions added to the sequence that aren't even in the VCF. 

    0
    Comment actions Permalink
  • Avatar
    Ryan Scalsky

    Adding additional information that may be relevant:

    1. I reran this with a different sample so there was something to compare. When running GATK FastaAlternateReferenceMaker on the single sample vcf's containing only GT = 0/1 or 1/0 and no symbolics the output fasta sequence is the exact same, confirmed by multi alignment and by examining the .dict files but the M5 is different between both. 

    2. When comparing the output from the VCF with GT = 0/1 or 1/0 and no symoblics to the reference the variants that have been modified do not reflect the contents of the input VCF. 

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    I think the contents are still reflecting what is inside the VCF however 2 rules are in effect. If there are multiple variants at a site a random one will be picked. Secondly INDELs are simplified by leftaligning and trimming before being applied to the reference fasta. 

    I am not be best to decide what would be the most optimal solution however I would be extra careful to integrate INDELs into a reference genome since all coordinates shift and annotations become incompatible however I would definitely only keep biallelic SNPs. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk