Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Missing samples from the output vcf file produced with GenotypeGVCFs

0

5 comments

  • Avatar
    Bhanu Gandham

    Hi Patrícia H. Brito

     

    That an interesting problem. Can you run SelectVariants to convert the genomicsdb to a vcf and see if all the samples are present in the vcf. You could do this on a subset of variants. more info on how to use the tool is here: https://gatk.broadinstitute.org/hc/en-us/articles/360051305531-SelectVariants

    0
    Comment actions Permalink
  • Avatar
    Patrícia H. Brito

    Hi,

    I run SelectVariants on my database of 113 samples and the output is still a vcf file with 100 samples.
    I am checking the number of samples using bcftools query -l filename.vcf | wc -l

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Patrícia H. Brito

    1. Are you on a shared NFS?
    2. How much physical memory do you have on your machine? I suspect that the tool is running out of memory which is why it is behaving this way. Can you also share the specs of the machine you are using?
    3. Take a look at this thread and try to implement the steps mentioned here to fix genomicsdb memory issues: https://gatk.broadinstitute.org/hc/en-us/community/posts/360074224671-GenomicsDBImport-running-out-of-memory-
    4. Can you try to recreate this issue with a smaller number of samples? Maybe try with 10 samples and let me know if this issue persists. 
    0
    Comment actions Permalink
  • Avatar
    Patrícia H. Brito

    Hi Bhanu Gandham,

    After fighting with my pipeline I realized that there was an error in the sampleMap for some of the samples. I also added “--consolidate --reader-threads 4” to the GenomicsDBImport command line. Everything together thankfully fixed the problem. My pipeline is working fine now. Sorry for the time you spent with this. Anyway, GATK helpdesk is great, many good ideas for troubleshooting here!

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Good to know you were able to solve the issue and thanks for posting the solution!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk