Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

SelectVariants by genotype

0

8 comments

  • Avatar
    registered_user

    Looks like the "--remove-unused-alternates" method does not work. I am still getting rows like this in the split sample files:

    chr2 70143941 . G T . PASS AC=1;AF=0.500;AN=2;AS_FilterStatus=SITE;AS_SB_TABLE=555,611|4,6;DP=114;ECNT=1;GERMQ=93;MBQ=32,31;MFRL=309,363;MMQ=60,60;MPOS=31;NALOD=1.70;NLOD=14.70;POPAF=6.00;ROQ=33;TLOD=8.19 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:114,0:9.123e-03:114:62,0:51,0:55,59,0,0

    I'm not sure why GT is "0/1" and AF is non-zero when AD is "114,0"? This location in this sample clearly does not have the mutation...!

    I need to be able to separate the calls from the multisample VCF file to calls that are only present in a certain sample or a subset of samples, how should I do this?

    0
    Comment actions Permalink
  • Avatar
    registered_user

    My next idea was to do further filtering based on the AF in the sample data which is "9.123e-03" in the above example row (chr2 70143941)...

    Looks like SelectVariants adds

    "AC=1;AF=0.500;AN=2" 

    to the INFO field for every single row, and because of this my filter

    --selectExpressions "AF > 0.05"

    matches every row...? This "AF=0.500" is not present in the original multisample file INFO field and I cannot imagine why "AF=0.500" is added to every row when selecting the variants. How do I apply a filter to the actual AF value in the sample data?

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    You could try:

    gatk SelectVariants -V input.vcf -R reference.fasta -sn Sample_01 -out sample.vcf

    You may use the -sn flag several times so as to select several samples, or use it to point to a file containing a sample name on every line.

    0
    Comment actions Permalink
  • Avatar
    registered_user

    The "-sn" flag is just the short version of "--sample-name".

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi registered_user

     

    Use --exclude-non-variants and --remove-unused-variants to remove rows that are not present in the sample in the resulting VCF with SelectVariants.

     

     

    0
    Comment actions Permalink
  • Avatar
    registered_user

    Nope... this still leaves in variants like:

    GT:AD:AF:DP:F1R2:F2R1:SB 0/1:52,0:0.019:52:26,0:26,0:35,17,0,0

    I think there might be a bug somewhere in Mutect that sets the GT field erroneously to "0/1".

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    registered_user Mutect2's multi-sample mode does not attempt to separately genotype each tumor sample since, by assumption, they are from the same tumor.  Even for apparently obvious cases such as yours, once we say that a variant does in fact exist it becomes very hard to say that it is completely absent in a sample, since it might simply exist at a low allele fraction.

    Consider for example using Mutect2 in multisample mode to see if a certain mutation has been suppressed by chemotherapy.  If no reads out of 100 show the mutation in the post-treatment sample show the mutation, it is very possible that the mutation still exists at an allele fraction near 1%.  We do not wish to render judgment on this without being much more confident.

    I believe that we can do better and that the optimal approach is to cluster not just allele fractions within individual samples but trajectories of allele fractions from sample to sample.  If you have thousands of mutations with no supporting reads in the post-treatment sample you can conclude that the allele fraction of the subclone containing them has dropped to far below 1%.  Until then, however, multisample mode is far too primitive to call samples individually.

    0
    Comment actions Permalink
  • Avatar
    registered_user

    Thanks for the explanation. For cancer heterogeneity analysis, this is the exact opposite functionality that is required. I think SelectVariants should have some option to remove undetected variants (based on AD) when selecting samples. The --selectExpressions seems ineffective for this purpose. Also, it would be nice to be able to have a predetermined order of the VCF file columns for the selected, -sn argument samples, like the order in which the samples are specified on the command line.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk