GenomiCDBImport issue
REQUIRED for all errors and issues:
a) GATK version used:GATK4
This script is asking me some more parameter. Kindly check the script and suggest.
#!/bin/bash
# "combined_GVCF_genomicDB_normal_rns_sh" RN Sarma 17.6.2024
# Define directories
GVCF_DIR="/home/rnsarma/Desktop/Mungbean/GVCF_output"
REF_GENOME="/home/rnsarma/Desktop/Mungbean/mungbean_genome/Vigna_radiata.Vradiata_ver6.dna.toplevel.fa"
OUTPUT_DIR="/home/rnsarma/Desktop/Mungbean/Combined_GVCF"
DB_PATH="${OUTPUT_DIR}/genomicsdb"
LOG_FILE="${OUTPUT_DIR}/script_output.log"
# Redirect stdout and stderr to log file
exec > >(tee -i $LOG_FILE)
exec 2>&1
# Create output directory if it doesn't exist
mkdir -p $OUTPUT_DIR
# Generate the sample map file
SAMPLE_MAP="${OUTPUT_DIR}/cohort.sample_map"
> $SAMPLE_MAP # Ensure the file is empty before appending
for FILE in $GVCF_DIR/*_raw_variants.g.vcf; do
SAMPLE_NAME=$(basename $FILE _raw_variants.g.vcf)
echo -e "$SAMPLE_NAME\t$FILE" >> $SAMPLE_MAP
done
# Define the intervals
for INTERVAL in {1..11};
do
# Load the necessary environment
source /usr/local/bin/bioconda3.rc
conda activate gbs
# Run GenomicsDBImport
gatk GenomicsDBImport \
--genomicsdb-workspace-path ${DB_PATH}_${INTERVAL} \
--batch-size 50 \
-L $INTERVAL \
--sample-name-map $SAMPLE_MAP \
--tmp-dir /tmp \
--reader-threads 8
# Check if GenomicsDBImport was successful
if [ $? -ne 0 ]; then
echo "GenomicsDBImport failed for interval $INTERVAL"
exit 1
fi
# Combine GVCFs into a single VCF
gatk GenotypeGVCFs \
-R $REF_GENOME \
-V gendb://${DB_PATH}_${INTERVAL} \
-O ${OUTPUT_DIR}/combined_${INTERVAL}.vcf
# Check if GenotypeGVCFs was successful
if [ $? -ne 0 ]; then
echo "GenotypeGVCFs failed for interval $INTERVAL"
exit 1
fi
# Print a message indicating the completion of the current interval
echo "Completed processing interval $INTERVAL"
done
# Print a message indicating job completion
echo "All intervals processed and combined VCF files generated"
-
Can you post the log generated by the tool so that we can check?
-
The script ended abruptly with this line at the end "
A USER ERROR has occurred: genomicsdb-workspace is not a recognized option
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace. -
It may be due to a typo in the script causing parameters to be interpretted/passed improperly. If not it could be a python interpreter issue which causes parameters not to be parsed properly. Can you share more information about what java and python versions are active in this compute environment?
-
Java version: openjdk version "11.0.23" 2024-04-16
OpenJDK Runtime Environment (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1, mixed mode, sharing)Python version: Python 3.10.12
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/local/bioconda3/envs/gbs/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar -version
The Genome Analysis Toolkit (GATK) v4.5.0.0
HTSJDK Version: 4.1.0
Picard Version: 3.1.1.The script used:
#!/bin/bash
# "combined_GVCF_genomicDB_normal_rns_sh" RN Sarma 17.6.2024# Define directories
GVCF_DIR="/home/rnsarma/Desktop/Mungbean/GVCF_output"
REF_GENOME="/home/rnsarma/Desktop/Mungbean/mungbean_genome/Vigna_radiata.Vradiata_ver6.dna.toplevel.fa"
OUTPUT_DIR="/home/rnsarma/Desktop/Mungbean/Combined_GVCF"
DB_PATH="${OUTPUT_DIR}/genomicsdb"
LOG_FILE="${OUTPUT_DIR}/script_output.log"
TMP_DIR="/home/rnsarma/tmp"# Create the temporary directory if it doesn't exist
mkdir -p "$TMP_DIR"# Redirect stdout and stderr to log file
exec > >(tee -i "$LOG_FILE")
exec 2>&1# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"# Generate the sample map file
SAMPLE_MAP="${OUTPUT_DIR}/cohort.sample_map"
> "$SAMPLE_MAP" # Ensure the file is empty before appending
for FILE in "$GVCF_DIR"/*_raw_variants.g.vcf; do
SAMPLE_NAME=$(basename "$FILE" _raw_variants.g.vcf)
echo -e "$SAMPLE_NAME\t$FILE" >> "$SAMPLE_MAP"
done# Load the necessary environment
source /usr/local/bin/bioconda3.rc
conda activate gbs# Define the intervals
for INTERVAL in {1..11}; do
# Define the workspace path for the current interval
INTERVAL_DB_PATH="${DB_PATH}_${INTERVAL}"# Create a list of GVCF files for the current interval
GVCF_FILES=()
for FILE in "$GVCF_DIR"/*_raw_variants.g.vcf; do
GVCF_FILES+=("-V $FILE")
done# Log the GVCF files being processed
echo "Processing the following GVCF files for interval $INTERVAL:"
for GVCF_FILE in "${GVCF_FILES[@]}"; do
echo "$GVCF_FILE"
done# Run GenomicsDBImport for the current interval
gatk --java-options "-Xmx4g" GenomicsDBImport \
--genomicsdb-workspace "${INTERVAL_DB_PATH}" \
"${GVCF_FILES[@]}"done # End of for INTERVAL loop
# Deactivate conda environment
conda deactivate# End of script
-
These 2 scripts are not the same. If the error message occurred when you used the second script it is clear that you used wrong parameter name for the GenomicsDBImport tool.
gatk --java-options "-Xmx4g" GenomicsDBImport \
--genomicsdb-workspace "${INTERVAL_DB_PATH}" \
"${GVCF_FILES[@]}"the correct parameter for the import path is
--genomicsdb-workspace-path
Regards.
-
My script, as shown below, has produced folders like "genomicsdb_1" to "genomicsdb_11" in the directory "/home/rnsarma/Desktop/Mungbean/Combined_GVCF". How to get one VCF file for all chromosomal regions from 1 to 11? I am new to GATK.
#!/bin/bash
# "combined_GVCF_genomicDB_normal_rns_sh" RN Sarma 22.6.2024
#@# modified based on suggestion of GATK help forum# Define directories
GVCF_DIR="/home/rnsarma/Desktop/Mungbean/GATK_best_practice/GVCF_output"
REF_GENOME="/home/rnsarma/Desktop/Mungbean/mungbean_genome/Vigna_radiata.Vradiata_ver6.dna.toplevel.fa"
OUTPUT_DIR="/home/rnsarma/Desktop/Mungbean/Combined_GVCF"
DB_PATH="${OUTPUT_DIR}/genomicsdb"
LOG_FILE="${OUTPUT_DIR}/script_output.log"
TMP_DIR="/home/rnsarma/tmp"# Create the temporary directory if it doesn't exist
mkdir -p "$TMP_DIR"# Redirect stdout and stderr to log file using tee
exec > >(tee -i "$LOG_FILE") 2>&1# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"# Log the contents of the GVCF directory for debugging
echo "Contents of GVCF_DIR ($GVCF_DIR):"
ls -l "$GVCF_DIR"# Generate the sample map file
SAMPLE_MAP="${OUTPUT_DIR}/cohort.sample_map"
> "$SAMPLE_MAP" # Ensure the file is empty before appending# Check if GVCF files exist and are readable
for FILE in "$GVCF_DIR"/*_raw_variants.g.vcf; do
if [[ ! -f "$FILE" ]]; then
echo "Error: Cannot find file $FILE"
exit 1
fi
if [[ ! -r "$FILE" ]]; then
echo "Error: Cannot read $FILE"
exit 1
fi
SAMPLE_NAME=$(basename "$FILE" _raw_variants.g.vcf)
echo -e "$SAMPLE_NAME\t$FILE" >> "$SAMPLE_MAP"
done# Log sample map contents for debugging
echo "Sample map contents:"
cat "$SAMPLE_MAP"# Load the necessary environment
source /usr/local/bin/bioconda3.rc
conda activate gbs# Define the intervals
for INTERVAL in {1..11}; do
# Define the workspace path for the current interval
INTERVAL_DB_PATH="${DB_PATH}_${INTERVAL}"# Remove existing GenomicsDB workspace directory if it exists
if [[ -d "$INTERVAL_DB_PATH" ]]; then
echo "Removing existing GenomicsDB workspace directory: $INTERVAL_DB_PATH"
rm -rf "$INTERVAL_DB_PATH"
fi# Log the current interval processing
echo "Processing interval ${INTERVAL}..."# Run GenomicsDBImport for the current interval
gatk --java-options "-Xmx4g" GenomicsDBImport \
--genomicsdb-workspace-path "${INTERVAL_DB_PATH}" \
--sample-name-map "$SAMPLE_MAP" \
--tmp-dir "$TMP_DIR" \
-L "${INTERVAL}"# Check if the command was successful
if [[ $? -ne 0 ]]; then
echo "Error: GenomicsDBImport failed for interval ${INTERVAL}"
exit 1
fi# Define the output VCF file for the current interval
OUTPUT_VCF="${OUTPUT_DIR}/combined_interval_${INTERVAL}.vcf"# Run GenotypeGVCFs to produce the combined VCF file
gatk --java-options "-Xmx4g" GenotypeGVCFs \
-R "$REF_GENOME" \
-V gendb://${INTERVAL_DB_PATH} \
-O "${OUTPUT_VCF}"# Check if the command was successful
if [[ $? -ne 0 ]]; then
echo "Error: GenotypeGVCFs failed for interval ${INTERVAL}"
exit 1
fidone # End of for INTERVAL loop
# Deactivate conda environment
conda deactivate# End of script
-
Hi again.
Once you genotype all GenomicsDB instances you will receive a VCF file per instance and then you can use
gatk GatherVcfs
tool to combine them into a single VCF file for all chromosomes.
I hope this helps.
Regards.
-
The usage of gatk GatherVcfs is:
gatk GatherVcfsCloud \
-I cohortA_chr1.vcf.gz \
-I cohortA_chr2.vcf.gz \
-O cohortA_chr1chr2.vcf.gz1. Why is there is GAtherVcfCloud, as my output files are directories like "genomicsdb_1" to "genomicsdb_11" in the main directory of "/home/rnsarma/Desktop/Mungbean/Combined_GVCF"?
Another fie is "cohort.sample_map" there. PLease suggest how I can put -I cohortA_chr1.vcf.gz in the command line?
-
No the tool that you need to use is
gatk GatherVcfs
not GatherVcfsCloud. These 2 tools are different.
Please sign in to leave a comment.
9 comments