SelectVariants by genotype
I am trying to separate samples to individual VCF files from a multisample VCF file.
What is the correct way to remove rows that are not present in the sample in the resulting VCF with SelectVariants?
I have found out that by using "--sample-name" I can separate the columns (individual samples), which is fine, but the output still contains all rows present in the original multisample file. With "--remove-unused-alternates" I can set the ALT col values to "." and then do my own grep filtering. Surely there is a better way?
I am suspecting "--drop-genotype-annotation" might work, but I have not been able to figure out what the argument string should look like.
-
Looks like the "--remove-unused-alternates" method does not work. I am still getting rows like this in the split sample files:
chr2 70143941 . G T . PASS AC=1;AF=0.500;AN=2;AS_FilterStatus=SITE;AS_SB_TABLE=555,611|4,6;DP=114;ECNT=1;GERMQ=93;MBQ=32,31;MFRL=309,363;MMQ=60,60;MPOS=31;NALOD=1.70;NLOD=14.70;POPAF=6.00;ROQ=33;TLOD=8.19 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:114,0:9.123e-03:114:62,0:51,0:55,59,0,0
I'm not sure why GT is "0/1" and AF is non-zero when AD is "114,0"? This location in this sample clearly does not have the mutation...!
I need to be able to separate the calls from the multisample VCF file to calls that are only present in a certain sample or a subset of samples, how should I do this?
-
My next idea was to do further filtering based on the AF in the sample data which is "9.123e-03" in the above example row (chr2 70143941)...
Looks like SelectVariants adds
"AC=1;AF=0.500;AN=2"
to the INFO field for every single row, and because of this my filter
--selectExpressions "AF > 0.05"
matches every row...? This "AF=0.500" is not present in the original multisample file INFO field and I cannot imagine why "AF=0.500" is added to every row when selecting the variants. How do I apply a filter to the actual AF value in the sample data?
-
You could try:
gatk SelectVariants -V input.vcf -R reference.fasta -sn Sample_01 -out sample.vcf
You may use the -sn flag several times so as to select several samples, or use it to point to a file containing a sample name on every line.
-
The "-sn" flag is just the short version of "--sample-name".
-
Use --exclude-non-variants and --remove-unused-variants to remove rows that are not present in the sample in the resulting VCF with SelectVariants.
-
Nope... this still leaves in variants like:
GT:AD:AF:DP:F1R2:F2R1:SB 0/1:52,0:0.019:52:26,0:26,0:35,17,0,0
I think there might be a bug somewhere in Mutect that sets the GT field erroneously to "0/1".
-
registered_user Mutect2's multi-sample mode does not attempt to separately genotype each tumor sample since, by assumption, they are from the same tumor. Even for apparently obvious cases such as yours, once we say that a variant does in fact exist it becomes very hard to say that it is completely absent in a sample, since it might simply exist at a low allele fraction.
Consider for example using Mutect2 in multisample mode to see if a certain mutation has been suppressed by chemotherapy. If no reads out of 100 show the mutation in the post-treatment sample show the mutation, it is very possible that the mutation still exists at an allele fraction near 1%. We do not wish to render judgment on this without being much more confident.
I believe that we can do better and that the optimal approach is to cluster not just allele fractions within individual samples but trajectories of allele fractions from sample to sample. If you have thousands of mutations with no supporting reads in the post-treatment sample you can conclude that the allele fraction of the subclone containing them has dropped to far below 1%. Until then, however, multisample mode is far too primitive to call samples individually.
-
Thanks for the explanation. For cancer heterogeneity analysis, this is the exact opposite functionality that is required. I think SelectVariants should have some option to remove undetected variants (based on AD) when selecting samples. The --selectExpressions seems ineffective for this purpose. Also, it would be nice to be able to have a predetermined order of the VCF file columns for the selected, -sn argument samples, like the order in which the samples are specified on the command line.
Please sign in to leave a comment.
8 comments