Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs Output no call as reference genotypes

Answered
2

17 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi chen xx,

    Thanks for writing in! There were a couple bugs with GenotypeGVCFs in 4.2.5.0 and so we will be releasing a new GATK version to fix these bugs in the next few days. Could you try the new version once it is out and see if it solves your issue?

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    chen xx

    Hi Genevieve Brandt,

    Thanks for reply, I will test it later

    0
    Comment actions Permalink
  • Avatar
    Luca Marcolungo

    Hi Genevieve Brandt (she/her),

    I had the same issue and i will wait for the new release to test. Is it already planned the release date?

    Thanks in advance,

    Luca

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Luca,

    It should be this week. We ran into an issue last Thursday and are working to resolve it.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Luca Marcolungo

    Great, thanks Genevieve Brandt (she/her) for the quick reply.

    0
    Comment actions Permalink
  • Avatar
    chen xx

    Hi Genevieve Brandt,

     

    I have test it, and find it still not change. Just output 0/0:0,0:0:0:0,0,0 for no call samples.

    0
    Comment actions Permalink
  • Avatar
    Luca Marcolungo

    Hi Genevieve Brandt (she/her),

    It is the same also for me, i get 0/0:0,0:0:0:0,0,0 for no call positions.

    I am still using the database built using gatk 4.2.5. Should i rebuild the database using the latest version?

    Thanks in advance,

    Luca

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Luca Marcolungo yes, could you test a small sample to recreate this issue on a database built with 4.2.6.0?

    chen xx to confirm, these calls are completely from 4.2.6.0? 

    0
    Comment actions Permalink
  • Avatar
    Luca Marcolungo

    Hi Genevieve Brandt (she/her),

    I tested the issue on a small database generate using GATK 4.2.6.0 and I can confirm that I still get 0/0:0,0:0:0:0,0,0 for not genotyped positions.

    Let me know if you need any further information,

    Best,

    Luca

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you Luca! Can you upload this test database and all other files necessary to a bug report? It will be very helpful for us to fix this issue.

    Here are the instructions: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671. Please let me know once the file is there.

    0
    Comment actions Permalink
  • Avatar
    Luca Marcolungo

    Hi Genevieve Brandt (she/her),

    I uploaded a zip file named 'GenotypeGVCFs_Output_no_call_as_reference_genotypes.zip' in the ftp. Inside the zip file you can find a README text file reporting all the command line used to generate the database and the vcf file. 

    Just to let you know, i also tried to generate the database and the vcf file using the latest version (4.2.6.1) but the problem is still there.

    Please let me know if you need any further question

    Thanks in advance,

    Luca

    0
    Comment actions Permalink
  • Avatar
    Eric C. Anderson

    Hello,

    I wanted to also report that GATK 4.2.6.1 seems to only rarely designate genotypes as "missing" at an individual, even when there are no reads at the site in the individual.  I have a small test data set that I use for testing a snakemake workflow.  It is a standard bwa-mem mapping -> MarkDuplicates -> HaplotypeCaller -> GenomicsDBImport -> GenotypeGVCFs workflow. 

    Since it is a heavily whittled down data set, there are a lot of individuals with no reads at quite a few sites, but they almost all come out with genotype calls. For example:

     CM031202.1      2749372 .       TAAATGA T       85.01   .       AC=4;AF=0.286;AN=14;DP=3;ExcessHet=0;FS=0;MLEAC=4;MLEAF=0.286;MQ=60;QD=28.31;SOR=2.303  GT:AD:DP:GQ:PGT:PID:PL:PS       1|1:0,1:1:3:1|1:2749367_T_TA:45,3,0:2749367     0/0:0,0:0:0:.:.:0,0,0:. 0/0:0,0:0:0:.:.:0,0,0:. 0/0:0,0:0:0:.:.:0,0,0:. .|.:0,0:.:.:0|1:2749367_T_TA:0,0,0:2749367      0/0:1,0:1:3:.:.:0,3,42:.        0/0:0,0:0:0:.:.:0,0,0:. 1|1:0,1:1:3:1|1:2749367_T_TA:45,3,0:2749367

    Out of 5238 sites, only 29 of them include genotypes that are designated as missing (.|. or ./.). Here are five of them. It seems to me that they are only ever called as missing if they are considered phased sites:

    bcftools view -H -i 'GT~"\."' results/vcf/all.vcf.gz | head -n 5
    CM031202.1    2749342    .    AT    A    131.59    .    AC=4;AF=0.286;AN=14;DP=4;ExcessHet=0;FS=0;MLEAC=5;MLEAF=0.357;MQ=60;QD=29.53;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS    1|1:0,1:1:3:1|1:2749342_AT_A:45,3,0:2749342    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    .|.:0,0:.:.:0|1:2749342_AT_A:0,0,0:2749342    0/0:1,0:1:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    1|1:0,2:2:6:1|1:2749342_AT_A:90,6,0:2749342
    CM031202.1    2749345    .    T    A    131.67    .    AC=4;AF=0.286;AN=14;DP=4;ExcessHet=0;FS=0;MLEAC=5;MLEAF=0.357;MQ=60;QD=32.63;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS    1|1:0,1:1:3:1|1:2749342_AT_A:45,3,0:2749342    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    .|.:0,0:.:.:0|1:2749342_AT_A:0,0,0:2749342    0/0:1,0:1:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    1|1:0,2:2:6:1|1:2749342_AT_A:90,6,0:2749342
    CM031202.1    2749367    .    T    TA    126.59    .    AC=4;AF=0.286;AN=14;DP=4;ExcessHet=0;FS=0;MLEAC=5;MLEAF=0.357;MQ=60;QD=28.13;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS    1|1:0,1:1:3:1|1:2749367_T_TA:45,3,0:2749367    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    .|.:0,0:.:.:0|1:2749367_T_TA:0,0,0:2749367    0/0:1,0:1:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    1|1:0,2:2:6:1|1:2749367_T_TA:85,6,0:2749367
    CM031202.1    2749369    .    TC    T    126.59    .    AC=4;AF=0.286;AN=14;DP=4;ExcessHet=0;FS=0;MLEAC=5;MLEAF=0.357;MQ=60;QD=27.97;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS    1|1:0,1:1:3:1|1:2749367_T_TA:45,3,0:2749367    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    .|.:0,0:.:.:0|1:2749367_T_TA:0,0,0:2749367    0/0:1,0:1:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    1|1:0,2:2:6:1|1:2749367_T_TA:85,6,0:2749367
    CM031202.1    2749372    .    TAAATGA    T    85.01    .    AC=4;AF=0.286;AN=14;DP=3;ExcessHet=0;FS=0;MLEAC=4;MLEAF=0.286;MQ=60;QD=28.31;SOR=2.303    GT:AD:DP:GQ:PGT:PID:PL:PS    1|1:0,1:1:3:1|1:2749367_T_TA:45,3,0:2749367    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    0/0:0,0:0:0:.:.:0,0,0:.    .|.:0,0:.:.:0|1:2749367_T_TA:0,0,0:2749367    0/0:1,0:1:3:.:.:0,3,42:.    0/0:0,0:0:0:.:.:0,0,0:.    1|1:0,1:1:3:1|1:2749367_T_TA:45,3,0:2749367

    I cannot imagine that the expected behavior of GATK is to provide genotype calls where there are no reads.  It seems like this would be a severe problem for people doing low coverage whole genome sequencing.

    Please let me know if it would be helpful for you to have any of my test data.  It would be easy to send that off to you.

    Thanks so much for all you do!

    eric

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi all (chen xx Luca Marcolungo Eric C. Anderson),

    Thank you for your help and patience while I looked into this issue with the newest GATK versions. I spoke with our developers who have been doing this work on GenotypeGVCFs. This difference you are all showing here is actually a change we made on purpose to improve efficiency for large scale joint calling. 

    You can use DP in the FORMAT field to determine that these sites are no-calls instead of the missing allele in the genotype. 

    This is also described by our dev team in a similar question on github: https://github.com/broadinstitute/gatk/issues/7792.

    Please let me know if you have any other questions about this.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Eric C. Anderson

    Thank you for looking into that Genevieve.  I appreciate that.  

    Moving forward with VCF files produced by GATK, it would not be hard for me to identify missing genotypes as those that have DP=0 or sum of AD=0, etc.

    But, I am surprised that this change was made, apparently, without any announcement in the release notes when that change was made.  Were these rather dramatic changes announced in any release notes?  I am assuming that they probably were not, because if they were, I would have expected to be sent to those release notes for a discussion of the change, rather than to the GitHub issues page about it.  

    It seems to me that anyone using the recommendations from the bcftools manual on how to identify/filter/handle missing data, i.e.:

    bcftools view -i 'F_MISSING < 0.3' file.vcf
    bcftools view -i 'GT="mis"' file.vcf
    bcftools view -i 'GT[*]~"\."' file.vcf

    etc., is going to be in for a nasty surprise when that doesn't actually do what they think it is going to do.

    Those are things that can be dealt with by the user who has been made aware of the changes.  What will be more problematic is dealing with downstream analysis programs that identify missing data as ".|." or "./.", and do so in a way that is "baked into their code."  For example, BEAGLE version 4 allows for imputing missing genotypes from the variable-length Markov chain model used for the haplotype structure in the data.  Clearly, imputation at the sites where DP=0 would be desired, but it is not clear to me that this would be possible with one of the newer VCF files without an intermediate step in which the genotypes were converted to "./."'s.

    It is great to hear that this change can increase efficiency in large-scale joint calling.  You and the developers there are constantly in the trenches having to keep pace with the ridiculous (and wonderful!!) increases in sequencing capacity that we are seeing, and those efficiency gains will be universally appreciated.  However, in the interests of being able to use VCF files produced by GATK in myriad analysis pipelines, I would love to be able to keep in touch with y'all there about two things that I think would be very important:

    1. A GATK tool that can take as input a VCF file produced by GATK >= 4.2.6.1, and produce, as output, a VCF file that uses the missing field indicator as intended by the VCF 4.2 specification, and as was produced by many versions of GATK in the past.

    2. A more formal announcement that this change has been made to the VCF files produced by GATK.  It may be that the change does not affect users with very high read depth from model organisms, but I have numerous colleagues that use GATK for work on non-model organisms, and this change could be dynamite (in a negative way...) for many of their downstream analyses.

    Thanks so much for all that you do,

    eric

     

    0
    Comment actions Permalink
  • Avatar
    Eric C. Anderson

    Just answered my issue #1 (above) using a solution with bcftools. 

    If anyone is wondering how to code genotypes with DP=0 as missing, an easy solution with the bcftools setGT plugin is:

    bcftools +setGT results/vcf/all.vcf.gz -- -t q -n . -i 'FMT/DP=0'

    If anyone had any code that took as input parsed allele depths and was expecting a "." for cases with no reads, you might have to still update that to test for both "." and "0,0". 

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you Eric C. Anderson for the feedback! We definitely don't want this change to cause a lot of issues for our users. We wrote up a blog post help more users that may not have been aware of the change: https://gatk.broadinstitute.org/hc/en-us/articles/6012243429531

    If you have any other concerns, feel free to let us know so that we can keep those in mind for future GATK versions.

    0
    Comment actions Permalink
  • Avatar
    Eric C. Anderson

    Thanks a lot Genevieve!

    That is a really helpful blog post.  It's exciting to hear about all the backend changes and efficiency increases happening.  

    Best regards,

    eric

     

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk