Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomiCDBImport issue

0

9 comments

  • Avatar
    Gökalp Çelik

    Can you post the log generated by the tool so that we can check?

    0
    Comment actions Permalink
  • Avatar
    RN Sarma

    The script ended abruptly with this line at the end "

    A USER ERROR has occurred: genomicsdb-workspace is not a recognized option

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    It may be due to a typo in the script causing parameters to be interpretted/passed improperly. If not it could be a python interpreter issue which causes parameters not to be parsed properly. Can you share more information about what java and python versions are active in this compute environment?

    0
    Comment actions Permalink
  • Avatar
    RN Sarma

    Java version: openjdk version "11.0.23" 2024-04-16
    OpenJDK Runtime Environment (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1)
    OpenJDK 64-Bit Server VM (build 11.0.23+9-post-Ubuntu-1ubuntu122.04.1, mixed mode, sharing)

    Python version: Python 3.10.12

    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/local/bioconda3/envs/gbs/share/gatk4-4.5.0.0-0/gatk-package-4.5.0.0-local.jar -version
    The Genome Analysis Toolkit (GATK) v4.5.0.0
    HTSJDK Version: 4.1.0
    Picard Version: 3.1.1.

    The script used: 

    #!/bin/bash
    # "combined_GVCF_genomicDB_normal_rns_sh" RN Sarma 17.6.2024

    # Define directories
    GVCF_DIR="/home/rnsarma/Desktop/Mungbean/GVCF_output"
    REF_GENOME="/home/rnsarma/Desktop/Mungbean/mungbean_genome/Vigna_radiata.Vradiata_ver6.dna.toplevel.fa"
    OUTPUT_DIR="/home/rnsarma/Desktop/Mungbean/Combined_GVCF"
    DB_PATH="${OUTPUT_DIR}/genomicsdb"
    LOG_FILE="${OUTPUT_DIR}/script_output.log"
    TMP_DIR="/home/rnsarma/tmp"

    # Create the temporary directory if it doesn't exist
    mkdir -p "$TMP_DIR"

    # Redirect stdout and stderr to log file
    exec > >(tee -i "$LOG_FILE")
    exec 2>&1

    # Create output directory if it doesn't exist
    mkdir -p "$OUTPUT_DIR"

    # Generate the sample map file
    SAMPLE_MAP="${OUTPUT_DIR}/cohort.sample_map"
    > "$SAMPLE_MAP"  # Ensure the file is empty before appending
    for FILE in "$GVCF_DIR"/*_raw_variants.g.vcf; do
        SAMPLE_NAME=$(basename "$FILE" _raw_variants.g.vcf)
        echo -e "$SAMPLE_NAME\t$FILE" >> "$SAMPLE_MAP"
    done

    # Load the necessary environment
    source /usr/local/bin/bioconda3.rc
    conda activate gbs

    # Define the intervals
    for INTERVAL in {1..11}; do
      # Define the workspace path for the current interval
      INTERVAL_DB_PATH="${DB_PATH}_${INTERVAL}"

      # Create a list of GVCF files for the current interval
      GVCF_FILES=()
      for FILE in "$GVCF_DIR"/*_raw_variants.g.vcf; do
        GVCF_FILES+=("-V $FILE")
      done

      # Log the GVCF files being processed
      echo "Processing the following GVCF files for interval $INTERVAL:"
      for GVCF_FILE in "${GVCF_FILES[@]}"; do
        echo "$GVCF_FILE"
      done

      # Run GenomicsDBImport for the current interval
      gatk --java-options "-Xmx4g" GenomicsDBImport \
        --genomicsdb-workspace "${INTERVAL_DB_PATH}" \
        "${GVCF_FILES[@]}"

    done  # End of for INTERVAL loop

    # Deactivate conda environment
    conda deactivate

    # End of script

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    These 2 scripts are not the same. If the error message occurred when you used the second script it is clear that you used wrong parameter name for the GenomicsDBImport tool.

    gatk --java-options "-Xmx4g" GenomicsDBImport \
        --genomicsdb-workspace "${INTERVAL_DB_PATH}" \
        "${GVCF_FILES[@]}"

    the correct parameter for the import path is 

    --genomicsdb-workspace-path

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    RN Sarma

    My script, as shown below, has produced folders  like "genomicsdb_1" to "genomicsdb_11" in the directory "/home/rnsarma/Desktop/Mungbean/Combined_GVCF".  How to get one VCF file for all chromosomal regions from 1 to 11? I am new to GATK.

     

    #!/bin/bash
    # "combined_GVCF_genomicDB_normal_rns_sh" RN Sarma 22.6.2024
    #@# modified based on suggestion of GATK help forum

    # Define directories
    GVCF_DIR="/home/rnsarma/Desktop/Mungbean/GATK_best_practice/GVCF_output"
    REF_GENOME="/home/rnsarma/Desktop/Mungbean/mungbean_genome/Vigna_radiata.Vradiata_ver6.dna.toplevel.fa"
    OUTPUT_DIR="/home/rnsarma/Desktop/Mungbean/Combined_GVCF"
    DB_PATH="${OUTPUT_DIR}/genomicsdb"
    LOG_FILE="${OUTPUT_DIR}/script_output.log"
    TMP_DIR="/home/rnsarma/tmp"

    # Create the temporary directory if it doesn't exist
    mkdir -p "$TMP_DIR"

    # Redirect stdout and stderr to log file using tee
    exec > >(tee -i "$LOG_FILE") 2>&1

    # Create output directory if it doesn't exist
    mkdir -p "$OUTPUT_DIR"

    # Log the contents of the GVCF directory for debugging
    echo "Contents of GVCF_DIR ($GVCF_DIR):"
    ls -l "$GVCF_DIR"

    # Generate the sample map file
    SAMPLE_MAP="${OUTPUT_DIR}/cohort.sample_map"
    > "$SAMPLE_MAP"  # Ensure the file is empty before appending

    # Check if GVCF files exist and are readable
    for FILE in "$GVCF_DIR"/*_raw_variants.g.vcf; do
        if [[ ! -f "$FILE" ]]; then
            echo "Error: Cannot find file $FILE"
            exit 1
        fi
        if [[ ! -r "$FILE" ]]; then
            echo "Error: Cannot read $FILE"
            exit 1
        fi
        SAMPLE_NAME=$(basename "$FILE" _raw_variants.g.vcf)
        echo -e "$SAMPLE_NAME\t$FILE" >> "$SAMPLE_MAP"
    done

    # Log sample map contents for debugging
    echo "Sample map contents:"
    cat "$SAMPLE_MAP"

    # Load the necessary environment
    source /usr/local/bin/bioconda3.rc
    conda activate gbs

    # Define the intervals
    for INTERVAL in {1..11}; do
      # Define the workspace path for the current interval
      INTERVAL_DB_PATH="${DB_PATH}_${INTERVAL}"

      # Remove existing GenomicsDB workspace directory if it exists
      if [[ -d "$INTERVAL_DB_PATH" ]]; then
        echo "Removing existing GenomicsDB workspace directory: $INTERVAL_DB_PATH"
        rm -rf "$INTERVAL_DB_PATH"
      fi

      # Log the current interval processing
      echo "Processing interval ${INTERVAL}..."

      # Run GenomicsDBImport for the current interval
      gatk --java-options "-Xmx4g" GenomicsDBImport \
        --genomicsdb-workspace-path "${INTERVAL_DB_PATH}" \
        --sample-name-map "$SAMPLE_MAP" \
        --tmp-dir "$TMP_DIR" \
        -L "${INTERVAL}"

      # Check if the command was successful
      if [[ $? -ne 0 ]]; then
        echo "Error: GenomicsDBImport failed for interval ${INTERVAL}"
        exit 1
      fi

      # Define the output VCF file for the current interval
      OUTPUT_VCF="${OUTPUT_DIR}/combined_interval_${INTERVAL}.vcf"

      # Run GenotypeGVCFs to produce the combined VCF file
      gatk --java-options "-Xmx4g" GenotypeGVCFs \
        -R "$REF_GENOME" \
        -V gendb://${INTERVAL_DB_PATH} \
        -O "${OUTPUT_VCF}"

      # Check if the command was successful
      if [[ $? -ne 0 ]]; then
        echo "Error: GenotypeGVCFs failed for interval ${INTERVAL}"
        exit 1
      fi

    done  # End of for INTERVAL loop

    # Deactivate conda environment
    conda deactivate

    # End of script

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again.

    Once you genotype all GenomicsDB instances you will receive a VCF file per instance and then you can use 

    gatk GatherVcfs

    tool to combine them into a single VCF file for all chromosomes.

    I hope this helps.

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    RN Sarma

    The usage of gatk GatherVcfs is:
    gatk GatherVcfsCloud \
         -I cohortA_chr1.vcf.gz \
         -I cohortA_chr2.vcf.gz \
         -O cohortA_chr1chr2.vcf.gz

    1. Why is there is GAtherVcfCloud, as my output files are directories like "genomicsdb_1" to "genomicsdb_11" in the main directory of "/home/rnsarma/Desktop/Mungbean/Combined_GVCF"?

    Another fie is "cohort.sample_map" there.  PLease suggest how I can put -I cohortA_chr1.vcf.gz in the command line?

     

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    No the tool that you need to use is 

    gatk GatherVcfs

    not GatherVcfsCloud. These 2 tools are different. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk