Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Issue with GenomicsDBImport on very large chromosomes

0

3 comments

  • Avatar
    Louis Bergelson

    Hello,

    You've unfortunately hit an annoying problem that's hard to work around.  There is a limit on chromosome size when using a .bai index of 2^29-1 (536870911 bp).  The index can't handle anything longer than that.  There is a CSI index which supports longer ones but we don't have good support for it in gatk.  Splitting the chromosome is the right idea, you have to split it at the alignment stage into 2 differently named contigs.  You can't simply import half of it at a time into genomicsdb unfortunately.  I'm sorry to bring bad news.  I know that's a huge hassle since you already have the bams.  

    Louis

    0
    Comment actions Permalink
  • Avatar
    Shaun Clare

    I did find this with someone saying they found a work around using unzipped g.vcf but I'm not sure if that's just to get into GenomicsDB? I was working with unzipped g.vcf anyway and it didn't work.

    https://gatk.broadinstitute.org/hc/en-us/community/posts/4407400443803-GenomicsDBimport-and-CombineGVCF-does-not-show-variants-at-500-Mbp-onwards-although-gvcf-files-from-HapolypeCaller-report-variants

    That's unfortunate, it'll be a big pain to adjust all the positions afterwards to to get the real position but thank you for the clarification

    0
    Comment actions Permalink
  • Avatar
    Louis Bergelson

    It's possible there is a workaround using unzip vcf since that uses a different index format so it might have different limits. It's generally so unwieldy to use unzipped gvcfs that I don't have much experience doing so

    I'm sorry it's such a pain.  Creating a liftover file between the split reference and the actual reference might be a good solution but you'd probably have to use non-gatk/picard tools to handle it.  I'd like to fix it but it's non-trivial change and we have a lot of higher priority issues ahead of it.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk