Difference between merge-input-intervals and merge-contigs-into-num-partitions for GenomicsDBImport
I am following the GATK best practices pipeline, and, having previously run HaplotypeCaller with the -ERC GVCF flag, I am now using GenomicsDBImport for each chromosome (11) and all scaffolds in the reference genome. There are over 1000 scaffolds, but the total number of bps is relatively low (23244284), so I want to group all scaffolds together. Below is the job script I used for this process.
From what I read online, I decided to use --merge-contigs-into-num-partitions 1 to try to achieve this grouping; however, the log (also below) seems to suggest that I should have used merge-input-intervals instead. Can anyone provide advice on the difference between these options and which one would be more appropriate in my situation?
Thank you in advance!
a) GATK version used:
4.2.6.1
b) Exact command used:
Job script:
#!/bin/bash
#SBATCH --partition=x
#SBATCH --job-name=sacffold_import
#SBATCH --output=%x_%j.out # Output file (stdout)
#SBATCH --error=%x_%j.err # Error file (stderr)
#SBATCH --mail-type=ALL # Email notification: BEGIN,END,FAIL,ALL
#SBATCH --mail-user=x # Email address for notifications
#SBATCH --time=7-0:00:00 # job time limit D-HH:MM:SS
#SBATCH --nodes=1 # Number of nodes per instance
#SBATCH --ntasks-per-node=1 # Number of tasks per node
#SBATCH --cpus-per-task=8 # Number of cores per task (threads)
#SBATCH --mem=64G # Memory usage
# Arguments:
# $1 - Path to the scaffold list file
# $2 - Path to the sample ID & vcf list file
# Load required modules
module load gatk/4.2.6.1
# Source for scratch/output (be sure to run 00_prepare_directories.sh first)
source ./scratch_env.sh
# Capture the arguments passed to this script
SCAF_LIST=$1
SAMPLE_MAP=$2
# Set directory variables
TMP_DIR=$SCRATCH_DIR/tmp
RESULTS_DIR=./genomics_db
SCAF_DIR=$RESULTS_DIR/scaffolds
# Make directories
mkdir -p $TMP_DIR $RESULTS_DIR
# Run GenomicsDBImport for the chromosome
gatk --java-options "-Xmx52g -Xms52g" \
GenomicsDBImport \
--genomicsdb-workspace-path $SCAF_DIR \
--batch-size 50 \
-L $SCAF_LIST \
--sample-name-map $SAMPLE_MAP \
--merge-contigs-into-num-partitions 1 \
--tmp-dir $TMP_DIR \
--reader-threads 8
c) Entire program log:
14:13:55.145 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/weka/apps/gatk/4.2.6.1/gatk-4.2.6.1/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:13:55.289 INFO GenomicsDBImport - ------------------------------------------------------------
14:13:55.289 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.6.1
14:13:55.289 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
14:13:55.290 INFO GenomicsDBImport - Executing as brianna.banting@cn130 on Linux v5.14.0-427.42.1.el9_4.x86_64 amd64
14:13:55.290 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_92-b14
14:13:55.290 INFO GenomicsDBImport - Start Date/Time: 12 March 2025 14:13:55 PDT
14:13:55.290 INFO GenomicsDBImport - ------------------------------------------------------------
14:13:55.290 INFO GenomicsDBImport - ------------------------------------------------------------
14:13:55.290 INFO GenomicsDBImport - HTSJDK Version: 2.24.1
14:13:55.290 INFO GenomicsDBImport - Picard Version: 2.27.1
14:13:55.291 INFO GenomicsDBImport - Built for Spark Version: 2.4.5
14:13:55.291 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:13:55.291 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:13:55.291 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:13:55.291 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:13:55.291 INFO GenomicsDBImport - Deflater: IntelDeflater
14:13:55.291 INFO GenomicsDBImport - Inflater: IntelInflater
14:13:55.291 INFO GenomicsDBImport - GCS max retries/reopens: 20
14:13:55.291 INFO GenomicsDBImport - Requester pays: disabled
14:13:55.291 INFO GenomicsDBImport - Initializing engine
14:13:55.660 INFO IntervalArgumentCollection - Processing 23244284 bp from intervals
14:13:55.662 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
14:13:55.664 INFO GenomicsDBImport - Done initializing engine
14:13:55.880 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
14:13:55.884 INFO GenomicsDBImport - Vid Map JSON file will be written to /weka/data/lab/porter/gen_cost_dom_B/phaseolus_vulgaris/./genomics_db/scaffolds/vidmap.json
14:13:55.884 INFO GenomicsDBImport - Callset Map JSON file will be written to /weka/data/lab/porter/gen_cost_dom_B/phaseolus_vulgaris/./genomics_db/scaffolds/callset.json
14:13:55.884 INFO GenomicsDBImport - Complete VCF Header will be written to /weka/data/lab/porter/gen_cost_dom_B/phaseolus_vulgaris/./genomics_db/scaffolds/vcfheader.vcf
14:13:55.884 INFO GenomicsDBImport - Importing to workspace - /weka/data/lab/porter/gen_cost_dom_B/phaseolus_vulgaris/./genomics_db/scaffolds
14:13:55.884 WARN GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.
14:13:57.061 INFO GenomicsDBImport - Importing batch 1 with 50 samples
-
Your situation exactly matches to the definition of that recommended parameter. The one you used is only usable when only a single whole contig is given as interval. The recommended parameter by the tool is suitable when multiple intervals are given all at once. Regardless of the presence of the recommended parameter outputs will be the same. Parameter will perform merging all intervals into single import and therefore may get you a performance boost although it may depend on the number of variants.
I hope this helps.
Regards.
Please sign in to leave a comment.
1 comment