Split and trim multiallelic sites, whitout missing genotype info
I have a cohort of 200 whole-exomes for which I've created a table of variants with their frequencies.
I want to analyse new exomes and filter their frequent variants based on my cohort. But the multiallelic sites is a hurdle ahead. In a new sample, I have this variant:
CHROM POS REF ALT QUAL INFO FORMAT NEWSAMPLE
chr3 3145502 CAAA C,CA 737.02 AC=1,1;AF=0.500,0.500;AN=2;DP=35;ExcessHet=3.0103;FS=0.000;MLEAC=1,1;MLEAF=0.500,0.500;MQ=60.00;QD=29.06;SOR=1.528 GT:AD:DP:GQ:PL 1/2:0,10,9:27:99:754,188,144,288,0,214
And in my cohort, at this chromosomal location, I have:
CHROM POS REF ALT HET HOM-VAR NCALLED NSAMPLES AF
chr3 3145502 CAAAAA C 96 1 102 227 0.005
chr3 3145502 CAAAAA CA 96 1 102 227 0.054
chr3 3145502 CAAAAA CAA 96 1 102 227 0.328
chr3 3145502 CAAAAA CAAA 96 1 102 227 0.417
chr3 3145502 CAAAAA CAAAA 96 1 102 227 0.142
So, the frequencies of the two variants called in my new sample are 0.3 and 0.4. It is obvious that these deletions are repetitive and could not be related to a rare disorder. But, my pipeline cannot match the called variants with their counterparts in the cohort, because of two reasons:
- the ALT column in the sample's VCF has comma separated values,
- such deletions are not in their minimal representation in my cohort.
I used command below to split multiallelic sites to biallelics, and trim them:
gatk-4.1.7.0/gatk LeftAlignAndTrimVariants -V newsample.vcf.gz -R GRCh38.fasta --split-multi-allelics -O newsample_split.vcf.gz
After splitting, the two called variants are converted to:
CHROM POS REF ALT QUAL INFO FORMAT NEWSAMPLE
chr3 3145502 CAAA C 737.02 . GT:AD:DP ./.:0,10:27
chr3 3145502 CAA C 737.02 . GT:AD:DP ./.:0,9:27
Indeed, it has split the two variants, but the problem is I have missed variants information and even the sample's genotype!
then I used BCFtools:
bcftools norm -m - -f GRCh38.fasta -o newsample_split2.vcf.gz -O z newsample.vcf.gz
which resulted in:
CHROM POS REF ALT QUAL INFO FORMAT NEWSAMPLE
chr3 3145502 CAAA C 737.02 AC=1;AF=0.5;AN=2;DP=35;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.06;SOR=1.528 GT:AD:DP:GQ:PL 1/0:0,10:27:99:754,188,144
chr3 3145502 CAA C 737.02 AC=1;AF=0.5;AN=2;DP=35;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;QD=29.06;SOR=1.528 GT:AD:DP:GQ:PL 0/1:0,9:27:99:754,288,214
Seems that BCFtools splits comma separated values, and prints variants INFO as it was in the primary VCF. But still the genotype info is not correct, and when I split my cohort VCF to recreate the variants table, I get wrong HET/HOM-VAR numbers.
How should I deal with this challenge? You may have other solutions to filter out such frequent INDEL variants. Also, I wondering if there is something else I am not aware of that in this manner.
Thanks in advance.
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
Please sign in to leave a comment.
1 comment