GenomicsDBimport and CombineGVCF does not show variants at ~500 Mbp onwards, although gvcf files from HapolypeCaller report variants
I am calling SNPs from 219 diploid wheat samples of whole-genome sequencing.
The genome size is ~ 5.2 Gbp. And all the 7 chromosomes have a size of > 600 mbp.
The HaplotypeCaller runs smoothly without any errors. However, when I run GenomicDB, the chunks at ~ 520 Mbp onwards from all chromosomes didn't import the information from GVCF (without showing any error in the log file), and thus don't show any variants in the genotyped vcf files, even though the GVCF files from the haplotypeCaller contain variants.
On the other hand, importing from other intervals (from the start of chromosomes until ~ 500 Mbp) were successfully imported and we can see variants.
I used two different versions: 4.1.8.0 and the latest one (4.2.2.0)
The same issue happened when using CombineGVCF. No variants in all chromosome intervals at ~ 500 Mbp onwards.
What could be wrong? The haplotype GVCF files are normal.
Commandline for HaplotypCaller:
time -p gatk HaplotypeCaller --reference $REF --input $BAM/$PREFIX.rgroup.bam --native-pair-hmm-threads $CORES --emit-ref-confidence GVCF --output $VCF/$PREFIX.snps.indels.g.vcf
Commandline for GenomicDB:
time -p gatk GenomicsDBImport --variant $INPUT --genomicsdb-workspace-path $gVCF/$ChrName.$size --intervals $ChrName:$Start-$End --reader-threads $CORES
Commandline for GenotypeGVCFs:
time -p gatk GenotypeGVCFs --variant gendb://$gVCF/$ChrName.$size --reference $REF --output $gVCF/$ChrName.$size.vcf.gz --intervals $ChrName:$Start-$End
-
Hi Hanin,
It's puzzling that you were able to get variants from HaplotypeCaller beyond the ~520bp region but that there seems to be a length limit somewhere in the pipeline that is preventing you from getting variants beyond this point from GenomicsDB. Some index formats apparently have a limit of 536mbp for single chromosome length which may be causing this issue. Could you please try running SelectVariants on one of your vcf files and specifying an interval beyond about 540bp to see if you are able to get any variants? The GATK team may need to look further into which point in the pipeline is running into this length limit, but it may be that the solution for you would be to split your chromosomes into smaller units to get all variants from GenomicsDB.
Kind regards,
Pamela
-
Hi Pamela,
The problem is in the zipped (g.vcf.gz) files, for big genomes (chromosomes with size of > 500 mbp), the tbi index format can handle chromosomes up to ~ 530 Mbp. Indexing with csi is the option for large genomes with large chromosomes, but unfortunately, CombineGVCF and GenomicsDB don't accept it.
The only solution I found, is to work with unzipped g.vcf files (g.vcf)Indexing with tabix is mandatory for gzipped vcf files (g.vcf.gz).
If files are unzipped, tbi indexing is not required. And it seems that GenomicsDB and combineGVCFs work ok with unzipped gVCF files.Of course, the .idx is necessary. But .tbi is not required for unzipped vcf files.
Thanks!
-
Hi Hanin,
I'm glad that you were able to successfully work with the unzipped g.vcf files for your analysis. I do believe that downstream analysis in GATK should be compatible with csi indexing, as this is a common solution for people working on large chromosomes. Please let me know if you have any questions or need my help with anything else.
Kind regards,
Pamela
-
Hi Pamela,
Thank you very much!
csi indexing apparently compatible for HaplotypeCaller, indexing bam files with CSI works ok (bai indexing for bam file, has the limitation for large chromosome). However, csi index is not helping in GenomicsDB or CombineGVCF for larger chromosome and it’s still that only tbi index is acceptable, which has the limitation for larger chromosome size, check this: -
Hi Hanin,
Okay, I see, thank you for explaining. Unfortunately, I am not aware of any current method for getting around the issues with larger chromosomes and indexing. The GATK team does not support samtools or other indexing methods, but there may be other users that can provide some insight on solutions to working with large genomes. It seems that you were able to still complete your analysis using the unzipped files, however, I can submit a ticket if you would like for the GATK team to look into potential workarounds for you to use.
Kind regards,
Pamela
Please sign in to leave a comment.
5 comments