Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Can not create an Index file for my gvcf

0

6 comments

  • Avatar
    Gökalp Çelik

    Hi Julio Bonilla

    htsjdk.samtools.FileTruncatedException: Premature end of file:

    This error message indicates that the file you are trying to index has a corrupt gzip block information therefore it cannot be indexed. Can you check to see if you can extract the file?

    If not you may need to find an uncorrupt version of the file. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Julio Bonilla

    Hi Gökalp Çelik,

    Considering your answer, I first checked the bam files for error using the picard ValidateSamFile command. No bam file reported an error.

    I then used the gatk ValidateVariants command to check all the g.vcf.gz files. I found a couple of them did not pass, so I deleted them and redid the database using the GenomicsDBImport command again.

    Then I tried to run gatk GenotypeVCFs command but it stops suddenly without any errors or message. It just presents the prompt again.

    Here is the code I am using. It may be important to mention I am running GATK from the Docker container.

    gatk --java-options "-Xms15G -Xmx15G -XX:ParallelGCThreads=2" GenotypeGVCFs -R ../../../5.Ref_genome/Matina/GCA_000403535-chromosomes.fasta -V gendb://matina.genomicsDB -O matina.genotypes.vcf.gz -OVI true

    At first I thought the problem could have been too much RAM because even though I specified a max of 15Gb (I have 50 Gb available), the systems showed me it is using all 50Gb. Therefore I reduced it to -Xms2G -Xmx2G. But the problem was still the same.

    It starts running correctly, increasing the use of RAM progressively until it is using it all. I repeated this many times and it always stops around 8-10 hours. There are 10 chromosomes, and it usually stops at the 4th chromosome.

    Sorry if this changes the whole problem I began with.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Julio Bonilla

    What is the ploidy of your samples?

    0
    Comment actions Permalink
  • Avatar
    Julio Bonilla

    Hi Gökalp Çelik,

    It is diploid. I am working with Theobroma cacao.

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Can you try running it on single chromosomes by providing the parameter -L to see if any particular chromosome is causing the issue?

    There could be a memory leak however it is not something expected as our tests with human data always uses sensible amounts of memory and completes the whole task. One thing we do is that we run GenotypeGVCFs in shards split by hardmasked regions.

    Let us know how it goes.

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    Julio Bonilla

    Hi Gökalp Çelik,

    I tried what you suggested and it worked. I ran the Genotype command to each chromosome separately and the process worked smoothly. I then use the GatherVcfs command to create a single VCF file and continue with the filtrations of the variants.

    Thank you for solving this problem. I greatly appreciate it.

    JB

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk