A USER ERROR has occurred: Input files reference and reads have incompatible contigs: Found contigs with the same name but different lengths:
REQUIRED for all errors and issues:
a) GATK version used: GATK/4.3.0.0-GCCcore-11.3.0-Java-11
b) Exact command used:
#!/bin/bash
#SBATCH --job-name=gatk_haplotypecaller
#SBATCH --output=gatk_haplotypecaller.out
#SBATCH --error=gatk_haplotypecaller.err
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=danielle_becker@uri.edu
#SBATCH --account=putnamlab
#SBATCH --mem=120G
# Load required modules
module load GATK/4.3.0.0-GCCcore-11.3.0-Java-11
module load SAMtools/1.16.1-GCC-11.3.0
# Set input and output directories
INPUT_DIR="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM"
OUTPUT_DIR="/data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/gatk_haplotypecaller"
mkdir -p $OUTPUT_DIR
# Reference transcriptome
REF_TRANS="/data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta"
# Index the reference FASTA
samtools faidx $REF_TRANS
# Create sequence dictionary for the reference
gatk CreateSequenceDictionary -R $REF_TRANS
# Run HaplotypeCaller for each BAM file
for bam in ${INPUT_DIR}/*_sorted.bam
do
sample_name=$(basename $bam _sorted.bam)
gatk HaplotypeCaller \
-R $REF_TRANS \
-I $bam \
-O ${OUTPUT_DIR}/${sample_name}.g.vcf.gz \
-ERC GVCF
done
# Combine GVCFs
gatk CombineGVCFs \
-R $REF_TRANS \
$(for vcf in ${OUTPUT_DIR}/*.g.vcf.gz; do echo -n "-V $vcf "; done) \
-O ${OUTPUT_DIR}/combined.g.vcf.gz
# GenotypeGVCFs
gatk GenotypeGVCFs \
-R $REF_TRANS \
-V ${OUTPUT_DIR}/combined.g.vcf.gz \
-O ${OUTPUT_DIR}/output.vcf.gz
# Filter variants
gatk VariantFiltration \
-R $REF_TRANS \
-V ${OUTPUT_DIR}/output.vcf.gz \
-O ${OUTPUT_DIR}/filtered_output.vcf.gz \
--filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "my_filter"
echo "GATK HaplotypeCaller pipeline complete. Output files are in $OUTPUT_DIR"
# List the output files
echo "Output files:"
ls -lh $OUTPUT_DIR
c) Entire program log:
Using GATK jar /opt/software/GATK/4.3.0.0-GCCcore-11.3.0-Java-11/gatk-package-4.3.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/software/GATK/4.3.0.0-GCCcore-11.3.0-Java-11/gatk-package-4.3.0.0-local.jar CreateSequenceDictionary -R /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta
INFO 2024-09-29 17:00:34 CreateSequenceDictionary Output dictionary will be written in /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.dict
17:00:34.238 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/software/GATK/4.3.0.0-GCCcore-11.3.0-Java-11/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Sep 29 17:00:34 EDT 2024] CreateSequenceDictionary --REFERENCE /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sun Sep 29 17:00:34 EDT 2024] Executing as danielle_becker@n064.cluster.com on Linux 3.10.0-1160.119.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.2+9; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
[Sun Sep 29 17:00:34 EDT 2024] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2105540608
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.dict already exists. Delete this file and try again, or specify a different output file.
at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:220)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /opt/software/GATK/4.3.0.0-GCCcore-11.3.0-Java-11/gatk-package-4.3.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/software/GATK/4.3.0.0-GCCcore-11.3.0-Java-11/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.fasta -I /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/stringTie/BAM/12_sorted.bam -O /data/putnamlab/dbecks/Heatwave_A.pul_2022Project/data/SNP_calling/gatk_haplotypecaller/12.g.vcf.gz -ERC GVCF
17:00:36.751 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/software/GATK/4.3.0.0-GCCcore-11.3.0-Java-11/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:00:36.882 INFO HaplotypeCaller - ------------------------------------------------------------
17:00:36.883 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
17:00:36.883 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
17:00:36.883 INFO HaplotypeCaller - Executing as danielle_becker@n064.cluster.com on Linux v3.10.0-1160.119.1.el7.x86_64 amd64
17:00:36.883 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.2+9
17:00:36.884 INFO HaplotypeCaller - Start Date/Time: September 29, 2024 at 5:00:36 PM EDT
17:00:36.884 INFO HaplotypeCaller - ------------------------------------------------------------
17:00:36.884 INFO HaplotypeCaller - ------------------------------------------------------------
17:00:36.885 INFO HaplotypeCaller - HTSJDK Version: 3.0.1
17:00:36.885 INFO HaplotypeCaller - Picard Version: 2.27.5
17:00:36.885 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:00:36.886 INFO HaplotypeCaller - Deflater: IntelDeflater
17:00:36.886 INFO HaplotypeCaller - Inflater: IntelInflater
17:00:36.886 INFO HaplotypeCaller - GCS max retries/reopens: 20
17:00:36.884 INFO HaplotypeCaller - ------------------------------------------------------------
17:00:36.885 INFO HaplotypeCaller - HTSJDK Version: 3.0.1
17:00:36.885 INFO HaplotypeCaller - Picard Version: 2.27.5
17:00:36.885 INFO HaplotypeCaller - Built for Spark Version: 2.4.5
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:00:36.885 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:00:36.886 INFO HaplotypeCaller - Deflater: IntelDeflater
17:00:36.886 INFO HaplotypeCaller - Inflater: IntelInflater
17:00:36.886 INFO HaplotypeCaller - GCS max retries/reopens: 20
17:00:36.886 INFO HaplotypeCaller - Requester pays: disabled
17:00:36.886 INFO HaplotypeCaller - Initializing engine
17:01:06.974 INFO HaplotypeCaller - Shutting down engine
[September 29, 2024 at 5:01:06 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.50 minutes.
Runtime.totalMemory()=12746489856
***********************************************************************
A USER ERROR has occurred: Input files reference and reads have incompatible contigs: Found contigs with the same name but different lengths:
contig reference = TRINITY_DN115924_c0_g1_i1 / 222
contig reads = TRINITY_DN115924_c0_g1_i1 / 663.
reference contigs = [TRINITY_DN100801_c0_g1_i2, TRINITY_DN100801_c0_g1_i3, TRINITY_DN100801_c0_g2_i2, TRINITY_DN100801_c0_g2_i3, TRINITY_DN100801_c0_g2_i4, TRINITY_DN100801_c0_g3_i1, TRINITY_DN100801_c0_g4_i1, TRINITY_DN100801_c0_g4_i2, TRINITY_DN100801_c1_g1_i1, TRINITY_DN100801_c2_g1_i1, TRINITY_DN100754_c0_g1_i1, TRINITY_DN100802_c0_g1_i1, TRINITY_DN100802_c0_g1_i2, TRINITY_DN100802_c0_g1_i3, TRINITY_DN100802_c2_g1_i1, TRINITY_DN100802_c2_g1_i2, TRINITY_DN100802_c2_g1_i3,
-
This is calling variants from a de novo transcriptome RNAseq data
-
It looks like the bam is mapped onto a reference genome that is different from what you are using to call variants. You need to make sure that you use the same reference genome that your reads are aligned to.
Regards.
-
Hi, thank you for your response. I re did this with the correct reference and got the same errors. I am using a de novo transcriptome as the reference not a genome, can GATK work with them?
-
Hi again.
Yes GATK can work with any reference sequence you provide to it (given that contig sizes are less than 2^29-1 for proper indexing of output files).
Can you compare the contig lengths in your bam header with the sequence dictionary of the genome you are using to call variants?
Regards.
-
Quick note, I saw the following in the logs:
picard.PicardException: /data/putnamlab/dbecks/DeNovo_transcriptome/2023_A.pul/output/stranded_output/trinity_out_dir.Trinity.dict already exists. Delete this file and try again, or specify a different output file.
As Gokalp said, the best way to check whether this might be the cause would be to compare this sequence dictionary with the bam header. There might be a chance that even though you've switched to the correct reference (whether genome or transcriptome), the dictionary might correspond to the old reference and CreateSequenceDictionary is unable to overwrite unless you manually delete the old one first.
Please sign in to leave a comment.
5 comments