Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBimport and CombineGVCF does not show variants at ~500 Mbp onwards, although gvcf files from HapolypeCaller report variants

1

9 comments

  • Avatar
    Pamela Bretscher

    Hi Hanin,

    It's puzzling that you were able to get variants from HaplotypeCaller beyond the ~520bp region but that there seems to be a length limit somewhere in the pipeline that is preventing you from getting variants beyond this point from GenomicsDB. Some index formats apparently have a limit of 536mbp for single chromosome length which may be causing this issue. Could you please try running SelectVariants on one of your vcf files and specifying an interval beyond about 540bp to see if you are able to get any variants? The GATK team may need to look further into which point in the pipeline is running into this length limit, but it may be that the solution for you would be to split your chromosomes into smaller units to get all variants from GenomicsDB. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Hanin

    Hi Pamela,

    The problem is in the zipped (g.vcf.gz) files, for big genomes (chromosomes with size of > 500 mbp), the tbi index format can handle chromosomes up to ~ 530 Mbp. Indexing with csi is the option for large genomes with large chromosomes, but unfortunately, CombineGVCF and GenomicsDB don't accept it.

    The only solution I found, is to work with unzipped g.vcf files (g.vcf)

    Indexing with tabix is mandatory for gzipped vcf files (g.vcf.gz).
    If files are unzipped, tbi indexing is not required. And it seems that GenomicsDB and combineGVCFs work ok with unzipped gVCF files.

    Of course, the .idx is necessary. But .tbi is not required for unzipped vcf files.
     

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Hanin,

    I'm glad that you were able to successfully work with the unzipped g.vcf files for your analysis. I do believe that downstream analysis in GATK should be compatible with csi indexing, as this is a common solution for people working on large chromosomes. Please let me know if you have any questions or need my help with anything else. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Hanin

    Hi Pamela,
    Thank you very much!
    csi indexing apparently compatible for HaplotypeCaller, indexing bam files with CSI works ok (bai indexing for bam file, has the limitation for large chromosome). However,
    csi index is not helping in GenomicsDB or CombineGVCF for larger chromosome and it’s still that only tbi index is acceptable, which has the limitation for larger chromosome size, check this:

    https://github.com/broadinstitute/gatk/issues/6110

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Hanin,

    Okay, I see, thank you for explaining. Unfortunately, I am not aware of any current method for getting around the issues with larger chromosomes and indexing. The GATK team does not support samtools or other indexing methods, but there may be other users that can provide some insight on solutions to working with large genomes. It seems that you were able to still complete your analysis using the unzipped files, however, I can submit a ticket if you would like for the GATK team to look into potential workarounds for you to use. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Chen-Jui Yang

    Does the later version (4.3) of GATK support csi indexing in GenomicsDB or CombineGVCF now? 

    I have similar question when using AddOrReplaceReadGroups in Picard. It stopped running when reaching the length limit (536,083,229 bp in my case) and showed the following message: 

    Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index E100059690L1C011R02001607887 1/2 150b aligned to chr1.a:536870772-536870921.
            at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:141)
            at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:200)
            at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:185)
            at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
            at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
            at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
    Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record E100059690L1C011R02001607887 1/2 150b aligned to chr1.a:536870772-536870921.
            at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:132)
            at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:138)
            ... 5 more
    Caused by: java.lang.IllegalStateException: Read position too high for BAI bin indexing.
            at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1708)
            at htsjdk.samtools.BAMIndexer$BAMIndexBuilder$1.getIndexingBin(BAMIndexer.java:250)
            at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:96)
            at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:238)
            at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:130)
            ... 6 more

    I probably could add @RG line through samtools, but I wonder the following analyses may have similar length issue. 

     

    0
    Comment actions Permalink
  • Avatar
    Louis Bergelson

    Neither GATK or GenomicsDB support CSI indexes.  To process long contigs you unfortunately have to split them up.  It's something we've wanted to add support for but it's low priority since we're primarily human genomics focused and we don't have sufficient resources to do so right now.

    0
    Comment actions Permalink
  • Avatar
    ritesh yadav

    Hanin

    Hi Hanin, 

    The problem is in the zipped (g.vcf.gz) files, for big genomes (chromosomes with size of > 500 mbp), the tbi index format can handle chromosomes up to ~ 530 Mbp. Indexing with csi is the option for large genomes with large chromosomes, but unfortunately, CombineGVCF and GenomicsDB don't accept it. were  you able to figure out how to make the gatk pipeline work for large genomes like wheat ?? if yes , could you please  share  how did you optimize the process 

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi ritesh yadav

    You may wish to check options such as glnexus to combine wheat GVCFs and genotype them. glnexus uses htslib to read files therefore csi index is not a problem for it to work with. We cannot provide support for glnexus since it is not our product but currently it may be your only option. 

    Another option would be to split genome contigs into smaller pieces and later combine variants into a single VCF but this may take some tooling and scripting your way in and out so your mileage may vary. 

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk