GenotypeGvcfs has formatting issues in both v4.1.6.0 as v4.1.7.0
Hi,
I'm using v4.1.6.0 of GenotypeGvcfs to make a vcf, out of whole genome data from 19 samples (following your recommendations). When I run ValidateVariants to check the output of GenotypeGvcfs I get a error message, which states that one or more of the ALT allele are actually not in the samples provided. A previous user already found a similar error in ValidateVariants (https://gatk.broadinstitute.org/hc/en-us/community/posts/360061452132-GATK4-RNAseq-short-variant-discovery-SNPs-Indels-), but then for Haplotypecaller, and you have opened a bugreport to add a feature to ValidateVariants: https://github.com/broadinstitute/gatk/issues/6553
However, it would be nice if you could actually investigate the formatting error. Unfortunately my formatting error isn't the same as reported in the other post. I have 105 error in which the 1st alternative allele is a spanning deletion and the 2nd (and 3rd) is either an indel or snp. It's true that the 2nd and 3rd allele is actually not found in my samples. I even have 7 occurances in which the 1st allele (spanning deletion) has allele frequency 1.00.
my code is the following for GenotypeGVCFs:
java -Xms32G -Xmx32G -jar ${gatk4} GenotypeGVCFs -R ${ref} -V ${pipeline}/${name}_v4.1.6.0.g.vcf.gz -O ${vcf}/${name}_v4.1.6.0.vcf.gz -L ${pipeline}/${name}_intervals.list 2> ${log}/${name}_v4.1.6.0_genotype.log
for ValidateVariants:
java -Xms10G -Xmx10G -jar ${gatk4} ValidateVariants -R ${ref} -V ${name}_v4.1.6.0.vcf.gz -L ${pipeline}/${name}_intervals.list --warn-on-errors 2> ${log}/${name}_v4.1.6.0_genotype_valivar.log
the warning in ValidateVariants and the site look like this:
14:12:15.126 WARN ValidateVariants - ***** Input 1st_v4.1.6.0.vcf.gz fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position chr_1:1088200 are not observed at all in the sample genotypes *****
chr_1 1088200 . T *,TAAAAAAAAAAAA 64.39 . AC=8,0;AF=0.667,0.00;AN=12;DP=118;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.4286;MLEAC=7,7;MLEAF=0.583,0.583;MQ=58.73;QD=32.19;SOR=2.303 GT:AD:DP:GQ:PL ./.:9,0,0:9:.:0,0,0,0,0,0 0/0:9,0,0:9:0:0,0,113,0,113,113 ./.:10,0,0:10:.:0,0,0,0,0,0 ./.:5,0,0:5:.:0,0,0,0,0,0 1/1:0,0,1:1:0:225,15,0,15,0,0 ./.:0,0,0:0:.:0,0,0,0,0,0 ./.:12,0,0:12:.:0,0,0,0,0,0 ./.:8,0,0:8:.:0,0,0,0,0,0 0/0:3,0,0:3:0:0,0,43,0,43,43 ./.:7,0,0:7:.:0,0,0,0,0,0 ./.:1,0,0:1:.:0,0,0,0,0,0 ./.:0,0,0:0:.:0,0,0,0,0,0 ./.:3,0,0:3:.:0,0,0,0,0,0 ./.:7,0,0:7:.:0,0,0,0,0,0 1/1:0,0,0:0:0:45,3,0,3,0,0 ./.:0,0,0 1/1:0,0,1:1:0:45,3,0,3,0,0 1/1:0,0,0:0:0:267,18,0,18,0,0 ./.:9,0,0:9:.:0,0,0,0,0,0
The exactly the same happens when I run GenotypeGVCFs in --include-non-variant-sites and when I run GenotypeGVCFS and ValidateVariants in v4.1.7.0.
In principle, these sites just take up space in the vcf, as the correct behaviour of GenotypeGVCFS should result in the removal of ALT allele 2 and 3, which leads to the site not being called, as the spanning deletion is covered when it's called on the first base. I'm hoping that they disappear when I filter, but if not it's for me still a manageable amount to remove by hand even though that's not ideal. I just want to alert you to the weird behaviour, that I'm experiencing.
I hope you can do something with my post and thanks in advance,
-
Hi ABours
This is not a invalid vcf and not something to worry about.. ValidateVariants is going beyond vcf specs. I agree with you and we are working on creating better ValidateVariants messaging for this issue.
We have created a ticket for this and you can follow its progress here: https://github.com/broadinstitute/gatk/issues/6630
-
Hi Bhanu,
Thanks for your reply and making a ticket.
But like I said, you already have a ticket for this ValidateVariants issue, to allow these lines. However, I just wanted to point out that GenotypeGVCFs created this incorrectly formatted site, thus maybe you can also have a look at why GenotypeGVCFs keeps these sites in the vcf while there is no alternative allele except the spanning deletion, which already should be in the vcf as mark by deletion.
I'm sorry if this point/comment got lost in my previous post.
Best,
-
Hi ABours
Apologies I seemed to have missed that info. Let me look into this and get back to you.
PS: We are facing high volume of questions on the forum lately and it might take me a few days to get back to you. Rest assured this is on my radar.
-
Hi ABours
We would like to investigate by reproducing the error caused by GenotypeGVCFs. Can you please share a snippet of the data to reproduce? You can find information to share data here: https://gatk.zendesk.com/hc/en-us/articles/360035889671
-
Hi Bhanu,
I believe I shared the data (files_to_share_with_gatk_ABours.tar.gz), can you please confirm that you have?
Within the snippet I ensured that the three different version of the formatting error occurs.
Best,
-
Hi ABours
We are looking into it and will get back to you shortly.
-
Hi ABours
We looked into it and we don't think this is necessarily an unwanted behaviour. This call is trying to tell us that something is going on in this region. You can see that the PL values have 3 0s which means that it is equally likely that this variant could be alt1, alt2 or non-ref but it chose to represent it as hom alt1. This could be due to repeats in that region. This may not be the most accurate call but the tool is indicating that something is going on in this region and it should be inspected instead of not calling anything at all.
For more info on how PL ia calculated: https://gatk.broadinstitute.org/hc/en-us/articles/360035890451-Calculation-of-PL-and-GQ-by-HaplotypeCaller-and-GenotypeGVCFs
-
Hi Bhanu,
Thank you for checking it, and it's good to know what's happening here. It's just striking that I didn't have this on a similar run with a different reference.
Best,
Please sign in to leave a comment.
8 comments