Some <NON_REF> alleles remain after GenotypeGVCFs when using --include-non-variant-sites
AnsweredCan you please provide
a) GATK version used
b) Exact GATK commands used
gatk GenotypeGVCFs -R ref/11-ref.fa -V 11.gvcf.gz -L "11" -O 11-all.vcf.gz --include-non-variant-sites
gatk GenotypeGVCFs -R ref/11-ref.fa -V 11.gvcf.gz -L "11" -O 11.vcf.gz
Possibly related to the following threads:
I'm trying to generate a VCF file from a single sample that includes every site in the reference genome for iterative mapping and pseudo-assembly. This requires me to inject variants called back into the reference AND to soft-mask sites where there is not enough information to make a call. This means I need GATK to provide me with read depth and confidence at non-variant sites.
However, I'm running into an issue when using --include-non-variant-sites in the GenotypeGVCFs step: for some sites, the <NON_REF> allele remains after this step.
Specifically, in a VCF file with 122052550 sites (including non-variant sites), 9804 sites still have the <NON_REF> allele. If I run GenotypeGVCFs without --include-non-variant-sites, there are 0 sites which still have <NON_REF> afterwards.
Here are some of the lines that still have <NON_REF> after GenotypeGVCFs with --include-non-variant-sites:
11 3227415 . G A,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3227415_G_A:3227415
11 3244678 . T TAC,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3244641_C_CTCTT:3244641
11 3256761 . A G,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3256761_A_G:3256761
11 3256764 . T C,<NON_REF> . . . GT:AD:PGT:PID:PS .|.:0,0,0:0|1:3256761_A_G:3256761
11 3291976 . TGTGGGG T,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291977 . G *,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291978 . T *,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291979 . G *,<NON_REF> . . . GT:AD ./.:0,0,0
11 3291980 . G *,<NON_REF> . . . GT:AD ./.:0,0,0
And here are the corresponding lines in the GVCF file:
11 3227415 . G A,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3227415_G_A:3227415
11 3244678 . T TAC,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3244641_C_CTCTT:3244641
11 3256761 . A G,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3256761_A_G:3256761
11 3256764 . T C,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID:PS .|.:0|1:3256761_A_G:3256761
11 3291976 . TGTGGGG T,<NON_REF> 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT ./.
(The last four sites, 3291977-80, were not called as variants in the GVCF.)
While the <NON_REF> strings do cause problems with downstream tools, they are easy enough to remove. But its disconcerting that this would seemingly result in more variants being called with --include-non-variant-sites than without. So any explanation would be helpful!
Another thing I noticed while investigating this is that the vcf file that results from --include-non-variant-sites has fewer sites than the reference FASTA file that was used to generate the GVCF:
122082543 total sites in FA
122052550 total sites in VCF
I would expect them to have equal numbers of sites. Any insights on that would be useful as well.
Thanks so much!
@gwct We merged a fix for this earlier today. It will be in the GATK release planned for next week.
Awesome, thanks!
Were the variants at the sites not in the GVCF being added in erroneously? I ask so I can know if I need to re-run some things once the next release is out.
Hi gwct
It does not affect the genotype calls. The issue is just with the representation and is not a erroneous result. So you don't need to re-run the samples.
Please sign in to leave a comment.