Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBimport and CombineGVCF does not show variants at ~500 Mbp onwards, although gvcf files from HapolypeCaller report variants

0

5 comments

  • Avatar
    Pamela Bretscher

    Hi Hanin,

    It's puzzling that you were able to get variants from HaplotypeCaller beyond the ~520bp region but that there seems to be a length limit somewhere in the pipeline that is preventing you from getting variants beyond this point from GenomicsDB. Some index formats apparently have a limit of 536mbp for single chromosome length which may be causing this issue. Could you please try running SelectVariants on one of your vcf files and specifying an interval beyond about 540bp to see if you are able to get any variants? The GATK team may need to look further into which point in the pipeline is running into this length limit, but it may be that the solution for you would be to split your chromosomes into smaller units to get all variants from GenomicsDB. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Hanin

    Hi Pamela,

    The problem is in the zipped (g.vcf.gz) files, for big genomes (chromosomes with size of > 500 mbp), the tbi index format can handle chromosomes up to ~ 530 Mbp. Indexing with csi is the option for large genomes with large chromosomes, but unfortunately, CombineGVCF and GenomicsDB don't accept it.

    The only solution I found, is to work with unzipped g.vcf files (g.vcf)

    Indexing with tabix is mandatory for gzipped vcf files (g.vcf.gz).
    If files are unzipped, tbi indexing is not required. And it seems that GenomicsDB and combineGVCFs work ok with unzipped gVCF files.

    Of course, the .idx is necessary. But .tbi is not required for unzipped vcf files.
     

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Hanin,

    I'm glad that you were able to successfully work with the unzipped g.vcf files for your analysis. I do believe that downstream analysis in GATK should be compatible with csi indexing, as this is a common solution for people working on large chromosomes. Please let me know if you have any questions or need my help with anything else. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Hanin

    Hi Pamela,
    Thank you very much!
    csi indexing apparently compatible for HaplotypeCaller, indexing bam files with CSI works ok (bai indexing for bam file, has the limitation for large chromosome). However,
    csi index is not helping in GenomicsDB or CombineGVCF for larger chromosome and it’s still that only tbi index is acceptable, which has the limitation for larger chromosome size, check this:

    https://github.com/broadinstitute/gatk/issues/6110

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Hanin,

    Okay, I see, thank you for explaining. Unfortunately, I am not aware of any current method for getting around the issues with larger chromosomes and indexing. The GATK team does not support samtools or other indexing methods, but there may be other users that can provide some insight on solutions to working with large genomes. It seems that you were able to still complete your analysis using the unzipped files, however, I can submit a ticket if you would like for the GATK team to look into potential workarounds for you to use. 

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk