Some <NON_REF> alleles remain after GenotypeGVCFs when using --include-non-variant-sites
AnsweredCan you please provide
a) GATK version used
4.1.4.1
b) Exact GATK commands used
gatk GenotypeGVCFs -R ref/11-ref.fa -V 11.gvcf.gz -L "11" -O 11-all.vcf.gz --include-non-variant-sites
gatk GenotypeGVCFs -R ref/11-ref.fa -V 11.gvcf.gz -L "11" -O 11.vcf.gz
Possibly related to the following threads:
Hello,
I'm trying to generate a VCF file from a single sample that includes every site in the reference genome for iterative mapping and pseudo-assembly. This requires me to inject variants called back into the reference AND to soft-mask sites where there is not enough information to make a call. This means I need GATK to provide me with read depth and confidence at non-variant sites.
However, I'm running into an issue when using --include-non-variant-sites in the GenotypeGVCFs step: for some sites, the <NON_REF> allele remains after this step.
Specifically, in a VCF file with 122052550 sites (including non-variant sites), 9804 sites still have the <NON_REF> allele. If I run GenotypeGVCFs without --include-non-variant-sites, there are 0 sites which still have <NON_REF> afterwards.
Here are some of the lines that still have <NON_REF> after GenotypeGVCFs with --include-non-variant-sites:
11 3227415 . G A,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3227415_G_A:3227415
11 3244678 . T TAC,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3244641_C_CTCTT:3244641
11 3256761 . A G,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3256761_A_G:3256761
11 3256764 . T C,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3256761_A_G:3256761
11 3291976 . TGTGGGG T,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291977 . G *,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291978 . T *,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291979 . G *,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291980 . G *,<NON_REF> . . . GT:AD ./.:0,0,0
And here are the corresponding lines in the GVCF file:
11 3227415 . G A,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3227415_G_A:3227415
11 3244678 . T TAC,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3244641_C_CTCTT:3244641
11 3256761 . A G,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3256761_A_G:3256761
11 3256764 . T C,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3256761_A_G:3256761
11 3291976 . TGTGGGG T,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT ./.
(The last four sites, 3291977-80, were not called as variants in the GVCF.)
While the <NON_REF> strings do cause problems with downstream tools, they are easy enough to remove. But its disconcerting that this would seemingly result in more variants being called with --include-non-variant-sites than without. So any explanation would be helpful!
Another thing I noticed while investigating this is that the vcf file that results from --include-non-variant-sites has fewer sites than the reference FASTA file that was used to generate the GVCF:
122082543 total sites in FA
122052550 total sites in VCF
I would expect them to have equal numbers of sites. Any insights on that would be useful as well.
Thanks so much!
-
@gwct We merged a fix for this earlier today. It will be in the GATK release planned for next week.
-
Awesome, thanks!
Were the variants at the sites not in the GVCF being added in erroneously? I ask so I can know if I need to re-run some things once the next release is out.
-
Hi gwct
It does not affect the genotype calls. The issue is just with the representation and is not a erroneous result. So you don't need to re-run the samples.
Please sign in to leave a comment.
3 comments