Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Missing PS field in the VCF file produced by GenotypeGVCFs

1

6 comments

  • Avatar
    Derek Caetano-Anolles

    Hi, svitlana. The VCF output you have shown is the expected output — HaplotypeCaller does not output the Phase Set information that you are looking for, so you will need to troubleshoot the issue with Beagle.

    1
    Comment actions Permalink
  • Avatar
    svitlana

    Hello Derek!

    Thank you for your answer!

    I was expecting that sample columns in the VCF would always satisfy the format given by FORMAT field, so in this case they will all have 8 elements separated by ":" and if some values are not applicable, they would be replaced by dots. I suppose that this is the reason of Beagle failure: it wasn't able to find the eighth field for the last sample.

    And so the expected behavior of GenotypeGVCFs is to remove the last element for samples for which it is not applicable instead of replacing it by a dot?

    1
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    So if I am understanding correctly, you actually do have a PS field with your data, and it is only after using GenotypeGVCF that you lose it?

    Could you confirm the version number of the build you are using is the most recent one? There was a known bug with GenotypeGVCF that would lead to empty VCF fields, but it was fixed several years ago. I don't know if this is related, but I will see what I can do.

    1
    Comment actions Permalink
  • Avatar
    svitlana

    The version I use is 4.1.0.0, released one year ago. Unfortunately I cannot install a newer version on our cluster.

    As shown in the examples above, for the given variant (ctg9 107774) some gVCF files produced by HaplotypeCaller contain an explicit entry for the PS field and some of them don't contain it.

    In the VCF file produced by GenotypeGVCFs from these gVCF files, the FORMAT field contains "PS" for that variant, but in the columns corresponding to my samples the value for PS is missing for the samples that don't contain it in their gVCF files (by missing I mean it is removed instead of being replaced by a dot as it is done for other unknown fields, and so the number of fields for this entry does not correspond to the number of field specified by FORMAT).

    Concrete example: for the FORMAT

    GT:AD:DP:GQ:PGT:PID:PL:PS

    I have entries like that:

    1|1:1,16:17:12:1|1:107774_C_T:765,12,0:107774
    0/0:9,0:9:27:.:.:0,27,342

    In the first entry there are 8 fields because all phasing-related values are known in the corresponding gVCF file. In the last entry, there is no information about phasing in the gVCF file, so values for PGT and PID were replaced by dots, but the value for PS field was removed instead of being replaced by a dot.

    1
    Comment actions Permalink
  • Avatar
    Derek Caetano-Anolles

    Using GenotypeGVCFs, it is normal behavior for a trailing field to be dropped if it is empty.

    So, if this is causing you issues downstream of your pipeline, a workaround would be to add a filler field at the end of your files (ie. "...:PID:PL:PS:X") with a known non-empty value (ie. 1). Even if the PS field is empty, it will still remain visible because the last field X is non-empty.

    On the other hand, if you have proof that an expected PS field is being deleted, then this is a bug and will need to be addressed. Try searching through your files to verify whether the PS field is being truncated even when there is data in the field (or artificially inject fake values into the file and see if it gets deleted).

    1
    Comment actions Permalink
  • Avatar
    svitlana

    Thank you Derek!

    I think your solution of adding another field at the end will solve my problem. For the moment I don't have any proof of real values being deleted but if I find it I'll let you know.

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk