CombineGVCFs vs GenomicsDBImport for target sequencing data
REQUIRED for all errors and issues:
a) GATK version used: 4.5.0.0
b) Exact command used:
GATK_HEAP_SIZE=200g
REFERENCE="/scratch/genomics/holmk/bongo_genome/reference/HiCref_annot/barney_pseudo2.1_HiC.fasta"
rungatk CombineGVCFs -R $REFERENCE -O bongo_seqcap_raw.vcf \
--variant gvcfs.list
I am trying to combine my gvcf files from HaplotypeCaller. I tried using GenomicsDBImport but it ran out of memory 10 days into it. This variant calling is for target capture and I used my target bed file as the -L intervals. This may not have been correct.
mkdir -p temp/$JOB_ID
GATK_HEAP_SIZE=200g
Commands:
gatk GenomicsDBImport \
--genomicsdb-workspace-path genomicsworkspace \
--batch-size 50 \
--sample-name-map sample_map.txt \
--max-num-intervals-to-import-in-parallel 3 \
--tmp-dir temp/$JOB_ID \
--intervals ../../bwa-mem/SNPs_target_mod.bed
So I tried using CombineGVCF instead due to the memory issue and I got this error in the log file after 3 days.
22:55:35.867 INFO CombineGVCFs - Done initializing engine
22:55:36.181 INFO ProgressMeter - Starting traversal
22:55:36.182 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
22:55:38.265 WARN CombineGVCFs - Error: The requested interval contained no data in source VCF files
22:55:38.266 INFO ProgressMeter - unmapped 0.0 0 0.0
22:55:38.266 INFO ProgressMeter - Traversal complete. Processed 0 total variants in 0.0 minutes.
22:55:38.266 WARN CombineGVCFs - Error: The requested interval contained no data in source VCF files
22:55:38.302 INFO CombineGVCFs - Shutting down engine
[September 16, 2024 at 10:55:38 PM EDT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 4,897.95 minutes.
Runtime.totalMemory()=23175397376
= Mon Sep 16 22:55:38 EDT 2024 job bongo_combinegVCF done
I did not use an interval list for this. I am not actually sure what the interval list should be. I have over 35000 contigs, some are chromosome length. It did produce a vcf file, just not sure it is correct.
Thanks for any insight! Karen
-
Hi Karen Holm
Does your organism of interest have contigs with length more than 2^29? If so my bad news is none of our tools can work with that long contigs unless you split your contigs into smaller parts which may not be something you desire. Also 35000 contigs seems way too much to handle at the same time for these tools. What we recommend is to import variants into separate import instances per contig or per contig parts. It will make the process run much faster if also run simultaneously. Additionally if your ploidy is more than 2 we recommend you to try to reduce the number of alleles per site to make importing run with sane duration and resources. Higher the ploidy values the slower and more resource needing imports you will have. Also don't forget to leave much memory for the native GenomicsDBImport library as it works outside of the Java Heap size and may fail if there is not enough memory spared for it.
Regards.
Please sign in to leave a comment.
1 comment