GATG-4.8
My little slurm script gives me very low number of variant. Please suggest how to improve my GATK script to catch all variants
#!/bin/bash
#@(#)GATK_4.50-slurm_bioconda3_rns.sh 2024-05-15 R.N.Sarma
## #SBATCH -n 8 instead of 4 CPU 2027-05-27 R.N.Sarma
## The main script was submitted prefixed sbatch -w brahma to ensure jobs are submitted to the node brahma 2027-05-27 R.N.Sarma
## Added "StandardAnnotation", "min/max-alleles 2" and "EMIT_ALL_CONFIDENT_SITES" in the code.
# Path to the reference genome
reference="/home/rnsarma/Desktop/Mungbean/mungbean_genome/Vigna_radiata.Vradiata_ver6.dna.toplevel.fa"
# Output VCF file
output_vcf="../raw_variants_mungbean_new_slurm_1.vcf"
# Temporary directory
tmp_dir="/var/tmp"
# Directory containing *.sorted.bam files
input_directory="/home/rnsarma/Desktop/Mungbean/mapping_new_bioconda_3"
# Output directory
OUTPUT="GATK_4.50-slurm_bioconda3_rns"
mkdir -p "$OUTPUT" || { echo "Error: Failed to create output directory."; exit 1; }
cd "$OUTPUT" || { echo "Error: Failed to change directory."; exit 1; }
# Make a list of samples for loop
# The list is created based on file names in the directory
if [ ! -d "$input_directory" ]; then
echo "Error: Input directory '$input_directory' does not exist."
exit 1
fi
(cd $input_directory; ls) | grep -E '^([^.]+)\.' | sed 's/\.[^.]*$//' > Samplelist.$$
# Check if the sample list file exists
sample_list="Samplelist.$$"
if [[ ! -f $sample_list ]]; then
echo "Error: Sample list file $sample_list not found."
exit 1
fi
# Initialize an array to hold Slurm job IDs
declare -a JOB_IDS
# Submit one sbatch job per sample
while IFS= read -r i; do
cat <<EOF > main_script.sh
#!/bin/bash
#SBATCH --mem=64G
# Run in (gbs) bioconda3 environment
source /usr/local/bin/bioconda3.rc
conda activate gbs
# Change directory to the input directory
cd "$input_directory" || { echo "Error: Failed to change directory."; exit 1; }
# Create an array to store the file paths
declare -a input_files
# Read *.sorted.bam file paths and append them with "-I" flag
for file in *.sorted.bam; do
input_files+=("-I" "\$file")
done
# Construct the command array
command_args=("gatk" "HaplotypeCaller" "-R" "$reference")
command_args+=("\${input_files[@]}")
command_args+=("-O" "$output_vcf" "-G" "StandardAnnotation" "--output-mode" "EMIT_ALL_CONFIDENT_SITES" "--tmp-dir" "$tmp_dir")
# Print the command
echo "\${command_args[@]}"
# Run the command
"\${command_args[@]}"
EOF
# Submit the main Slurm job script
sbatch main_script.sh
done < "$sample_list"
# Clean up temporary files
rm "$sample_list" main_script.sh
# Print a message indicating job submission
echo "Submitted job for generating the VCF file: $output_vcf"
-
Hi RN Sarma
We cannot directly act on your slurm job script however it may be better for you to run your individual files one by one to see how it goes and whether you receive any error messages during the run.
Default parameters of HaplotypeCaller should suffice for most germline diploid cases however if your organism has a different ploidy then you need to adjust those parameters accordingly. If your organism of interest is diploid then we suggest you to run HaplotypeCaller with default parameters to see what comes out.
You also need to pay attention to the experimental setup you have. Depending on the type of experiment ( whole genome short reads, long reads, amplicon targetted sequencing etc.) the number of variants you may catch differs therefore it is up to your samples to provide that information to you.
I hope these help.
Regards.
Post is closed for comments.
1 comment