Missing PS field in the VCF file produced by GenotypeGVCFs
NOTE: This question was originally posted by user svitlana on our previous forum at Tuesday, 7 January 2020 at 14:26 GMT. Since the previous forum has been closed for posting, this question has been moved here.
- - - - - - - - - - - - - - - - - -
Hello,
I followed GATK best practices to produce a VCF file for 20 individuals. GATK version is 4.1.0.0.
The BAM files were all verified by ValidateSamFile, no errors or warnings were detected.
HaplotypeCaller was run with --emit-ref-confidence, on each file individually. All the files were then merged with CombineGVCFs and finally the resulting gVCF file was used to produce my VCF file with GenotypeGVCFs.
Ah the end, I tried to run Beagle on the VCF file but I got the following error:
IllegalArgumentException: ERROR: inconsistent number of alleles for sample SAMPLE at marker [ctg9 107774 . C T]
The VCF entry for this call is the following:
ctg9 107774 . C T 4311.43 . AC=14;AF=0.350;AN=40;BaseQRankSum=0.198;DP=295;ExcessHet=0.3091;FS=0.611;InbreedingCoeff=0.3322;MLEAC=14;MLEAF=0.350;MQ=59.85;MQRankSum=0.00;QD=27.29;ReadPosRankSum=0.361;SOR=0.618
GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,19:19:66:1|1:107737_C_T:990,66,0:107737 0/0:18,0:18:54:.:.:0,54,691 1|1:0,18:18:54:1|1:107774_C_T:806,54,0:107774 1|1:0,13:13:39:1|1:107774_C_T:581,39,0:107774 0/0:10,0:10:30:.:.:0,30,321 0/0:10,0:10:27:.:.:0,27,405 0/0:9,0:9:27:.:.:0,27,294 0/0:18,0:18:54:.:.:0,54,704 0|1:16,2:18:36:0|1:107774_C_T:36,0,666:107774 0/0:13,0:13:39:.:.:0,39,509 0|1:5,9:14:99:0|1:107774_C_T:363,0,183:107774 0|1:19,8:27:99:0|1:107774_C_T:279,0,758:107774 0/0:12,0:12:36:.:.:0,36,501 0/0:16,0:16:45:.:.:0,45,675 0|1:7,4:11:99:0|1:107774_C_T:144,0,324:107774 0|1:6,7:13:99:0|1:107774_C_T:276,0,231:107774 0/0:13,0:13:39:.:.:0,39,497 0|1:5,3:8:99:0|1:107774_C_T:111,0,201:107774 1|1:1,16:17:12:1|1:107774_C_T:765,12,0:107774 0/0:9,0:9:27:.:.:0,27,342
The SAMPLE indicated by Beagle corresponds to the last entry (0/0:9,0:9:27:.:.:0,27,342), which doesn't correspond to the format (GT:AD:DP:GQ:PGT:PID:PL:PS), the last field is missing.
If it can help, below are shown the entries from the gVCF file for the last sample:
ctg9 107769 . T <NON_REF> . . END=107782 GT:DP:GQ:MIN_DP:PL 0/0:11:27:9:0,27,342
ctg9 107783 . A <NON_REF> . . END=107792 GT:DP:GQ:MIN_DP:PL 0/0:9:24:9:0,24,360
And here is the entry for the same variant in another sample, which has the correct number of fields in the VCF.
ctg9 107774 . C T,<NON_REF> 751.10 . BaseQRankSum=-0.182;DP=19;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;MQRankSum=0.000;RAW_MQandDP=68400,19;ReadPosRankSum=0.511 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:1,16,0:17:12:0|1:107774_C_T:765,12,0,768,54,810:107774:1,0,5,11
I suppose that this problem is related to the physical phasing done by HaplotypeCaller.. so it could maybe be corrected by rerunning HaplotypeCaller with --do-not-run-physical-phasing option. But it will take several weeks to run on my data, so maybe there is another way to fix this problem?
Thank you in advance.
-
Hi, svitlana. The VCF output you have shown is the expected output — HaplotypeCaller does not output the Phase Set information that you are looking for, so you will need to troubleshoot the issue with Beagle.
-
Hello Derek!
Thank you for your answer!
I was expecting that sample columns in the VCF would always satisfy the format given by FORMAT field, so in this case they will all have 8 elements separated by ":" and if some values are not applicable, they would be replaced by dots. I suppose that this is the reason of Beagle failure: it wasn't able to find the eighth field for the last sample.
And so the expected behavior of GenotypeGVCFs is to remove the last element for samples for which it is not applicable instead of replacing it by a dot?
-
So if I am understanding correctly, you actually do have a PS field with your data, and it is only after using GenotypeGVCF that you lose it?
Could you confirm the version number of the build you are using is the most recent one? There was a known bug with GenotypeGVCF that would lead to empty VCF fields, but it was fixed several years ago. I don't know if this is related, but I will see what I can do.
-
The version I use is 4.1.0.0, released one year ago. Unfortunately I cannot install a newer version on our cluster.
As shown in the examples above, for the given variant (ctg9 107774) some gVCF files produced by HaplotypeCaller contain an explicit entry for the PS field and some of them don't contain it.
In the VCF file produced by GenotypeGVCFs from these gVCF files, the FORMAT field contains "PS" for that variant, but in the columns corresponding to my samples the value for PS is missing for the samples that don't contain it in their gVCF files (by missing I mean it is removed instead of being replaced by a dot as it is done for other unknown fields, and so the number of fields for this entry does not correspond to the number of field specified by FORMAT).
Concrete example: for the FORMATGT:AD:DP:GQ:PGT:PID:PL:PS
I have entries like that:
1|1:1,16:17:12:1|1:107774_C_T:765,12,0:107774
0/0:9,0:9:27:.:.:0,27,342
In the first entry there are 8 fields because all phasing-related values are known in the corresponding gVCF file. In the last entry, there is no information about phasing in the gVCF file, so values for PGT and PID were replaced by dots, but the value for PS field was removed instead of being replaced by a dot.
-
Using GenotypeGVCFs, it is normal behavior for a trailing field to be dropped if it is empty.
So, if this is causing you issues downstream of your pipeline, a workaround would be to add a filler field at the end of your files (ie. "...:PID:PL:PS:X") with a known non-empty value (ie. 1). Even if the PS field is empty, it will still remain visible because the last field X is non-empty.
On the other hand, if you have proof that an expected PS field is being deleted, then this is a bug and will need to be addressed. Try searching through your files to verify whether the PS field is being truncated even when there is data in the field (or artificially inject fake values into the file and see if it gets deleted). -
Thank you Derek!
I think your solution of adding another field at the end will solve my problem. For the moment I don't have any proof of real values being deleted but if I find it I'll let you know.
Please sign in to leave a comment.
6 comments