GenomicsDBimport and CombineGVCF does not show variants at ~500 Mbp onwards, although gvcf files from HapolypeCaller report variants
I am calling SNPs from 219 diploid wheat samples of whole-genome sequencing.
The genome size is ~ 5.2 Gbp. And all the 7 chromosomes have a size of > 600 mbp.
The HaplotypeCaller runs smoothly without any errors. However, when I run GenomicDB, the chunks at ~ 520 Mbp onwards from all chromosomes didn't import the information from GVCF (without showing any error in the log file), and thus don't show any variants in the genotyped vcf files, even though the GVCF files from the haplotypeCaller contain variants.
On the other hand, importing from other intervals (from the start of chromosomes until ~ 500 Mbp) were successfully imported and we can see variants.
I used two different versions: 4.1.8.0 and the latest one (4.2.2.0)
The same issue happened when using CombineGVCF. No variants in all chromosome intervals at ~ 500 Mbp onwards.
What could be wrong? The haplotype GVCF files are normal.
Commandline for HaplotypCaller:
time -p gatk HaplotypeCaller --reference $REF --input $BAM/$PREFIX.rgroup.bam --native-pair-hmm-threads $CORES --emit-ref-confidence GVCF --output $VCF/$PREFIX.snps.indels.g.vcf
Commandline for GenomicDB:
time -p gatk GenomicsDBImport --variant $INPUT --genomicsdb-workspace-path $gVCF/$ChrName.$size --intervals $ChrName:$Start-$End --reader-threads $CORES
Commandline for GenotypeGVCFs:
time -p gatk GenotypeGVCFs --variant gendb://$gVCF/$ChrName.$size --reference $REF --output $gVCF/$ChrName.$size.vcf.gz --intervals $ChrName:$Start-$End
-
Hi Hanin,
It's puzzling that you were able to get variants from HaplotypeCaller beyond the ~520bp region but that there seems to be a length limit somewhere in the pipeline that is preventing you from getting variants beyond this point from GenomicsDB. Some index formats apparently have a limit of 536mbp for single chromosome length which may be causing this issue. Could you please try running SelectVariants on one of your vcf files and specifying an interval beyond about 540bp to see if you are able to get any variants? The GATK team may need to look further into which point in the pipeline is running into this length limit, but it may be that the solution for you would be to split your chromosomes into smaller units to get all variants from GenomicsDB.
Kind regards,
Pamela
-
Hi Pamela,
The problem is in the zipped (g.vcf.gz) files, for big genomes (chromosomes with size of > 500 mbp), the tbi index format can handle chromosomes up to ~ 530 Mbp. Indexing with csi is the option for large genomes with large chromosomes, but unfortunately, CombineGVCF and GenomicsDB don't accept it.
The only solution I found, is to work with unzipped g.vcf files (g.vcf)Indexing with tabix is mandatory for gzipped vcf files (g.vcf.gz).
If files are unzipped, tbi indexing is not required. And it seems that GenomicsDB and combineGVCFs work ok with unzipped gVCF files.Of course, the .idx is necessary. But .tbi is not required for unzipped vcf files.
Thanks!
-
Hi Hanin,
I'm glad that you were able to successfully work with the unzipped g.vcf files for your analysis. I do believe that downstream analysis in GATK should be compatible with csi indexing, as this is a common solution for people working on large chromosomes. Please let me know if you have any questions or need my help with anything else.
Kind regards,
Pamela
-
Hi Pamela,
Thank you very much!
csi indexing apparently compatible for HaplotypeCaller, indexing bam files with CSI works ok (bai indexing for bam file, has the limitation for large chromosome). However, csi index is not helping in GenomicsDB or CombineGVCF for larger chromosome and it’s still that only tbi index is acceptable, which has the limitation for larger chromosome size, check this: -
Hi Hanin,
Okay, I see, thank you for explaining. Unfortunately, I am not aware of any current method for getting around the issues with larger chromosomes and indexing. The GATK team does not support samtools or other indexing methods, but there may be other users that can provide some insight on solutions to working with large genomes. It seems that you were able to still complete your analysis using the unzipped files, however, I can submit a ticket if you would like for the GATK team to look into potential workarounds for you to use.
Kind regards,
Pamela
-
Does the later version (4.3) of GATK support csi indexing in GenomicsDB or CombineGVCF now?
I have similar question when using AddOrReplaceReadGroups in Picard. It stopped running when reaching the length limit (536,083,229 bp in my case) and showed the following message:
Exception in thread "main" htsjdk.samtools.SAMException: Exception when processing alignment for BAM index E100059690L1C011R02001607887 1/2 150b aligned to chr1.a:536870772-536870921.
at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:141)
at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:200)
at picard.sam.AddOrReplaceReadGroups.doWork(AddOrReplaceReadGroups.java:185)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
Caused by: htsjdk.samtools.SAMException: Exception creating BAM index for record E100059690L1C011R02001607887 1/2 150b aligned to chr1.a:536870772-536870921.
at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:132)
at htsjdk.samtools.BAMFileWriter.writeAlignment(BAMFileWriter.java:138)
... 5 more
Caused by: java.lang.IllegalStateException: Read position too high for BAI bin indexing.
at htsjdk.samtools.SAMRecord.computeIndexingBin(SAMRecord.java:1708)
at htsjdk.samtools.BAMIndexer$BAMIndexBuilder$1.getIndexingBin(BAMIndexer.java:250)
at htsjdk.samtools.BinningIndexBuilder.processFeature(BinningIndexBuilder.java:96)
at htsjdk.samtools.BAMIndexer$BAMIndexBuilder.processAlignment(BAMIndexer.java:238)
at htsjdk.samtools.BAMIndexer.processAlignment(BAMIndexer.java:130)
... 6 moreI probably could add @RG line through samtools, but I wonder the following analyses may have similar length issue.
-
Neither GATK or GenomicsDB support CSI indexes. To process long contigs you unfortunately have to split them up. It's something we've wanted to add support for but it's low priority since we're primarily human genomics focused and we don't have sufficient resources to do so right now.
-
Hi Hanin,
The problem is in the zipped (g.vcf.gz) files, for big genomes (chromosomes with size of > 500 mbp), the tbi index format can handle chromosomes up to ~ 530 Mbp. Indexing with csi is the option for large genomes with large chromosomes, but unfortunately, CombineGVCF and GenomicsDB don't accept it. were you able to figure out how to make the gatk pipeline work for large genomes like wheat ?? if yes , could you please share how did you optimize the process
-
Hi ritesh yadav
You may wish to check options such as glnexus to combine wheat GVCFs and genotype them. glnexus uses htslib to read files therefore csi index is not a problem for it to work with. We cannot provide support for glnexus since it is not our product but currently it may be your only option.
Another option would be to split genome contigs into smaller pieces and later combine variants into a single VCF but this may take some tooling and scripting your way in and out so your mileage may vary.
Regards.
Please sign in to leave a comment.
9 comments