atk-4.1.4/gatk-126.96.36.199/gatk GenomicsDBImport --java-options "-Xmx50g -Xms50g -Djava.io.tmpdir=/data/work/tmp" --sample-name-map samples.block0.map -L block0.bed --batch-size 68 --tmp-dir=/data/work/tmp --genomicsdb-workspace-path cohort.block0
I'm running GenomicsDBImport on 68 samples on a genome of ~7.5GB consisting of 442020 contigs and having split these contigs into 100 bed files (each bed file contains about 4420 contigs). These 100 jobs have been running for over 10 days at this point. I have used GenomicsDBImport before on larger genomes (4.5GB for instance) that consisted of full chromosomes instead of small contigs, and GenomicsDBImport finished much faster (more like hours instead of weeks).
I have also tried --merge-input-intervals true but that did not seem to significantly speed up anything. This might have to with the intervals spanning the entire contig/genome, where an exome only contains a fraction of the entire genome.
Could anyone explain why this process is so much slower when there are more contigs, and is there anything I could do about this? Right now I'm testing gatk 4.1.6 to see if that resolves anything.
Thanks in advance!
Please sign in to leave a comment.