GATK MarkDuplicates Missing output file
I am receiving no output file for my GATK run, although it seems the input is working.
a) GATK version used:
module load gatk/4.1.8.1-gcc-2onk
b) Exact command used:
gatk MarkDuplicates -I $OUTPUT_DIR/bam_sorted/${SAMPLE_NAME}_trimmed_bwaMem_readGroups_sorted.bam -O $OUTPUT_DIR/bam_dedup/${SAMPLE_NAME}_trimmed_bwaMem_readGroups_sorted_dedup.bam -M $OUTPUT_DIR/bam_dedup/${SAMPLE_NAME}_dedup_metrics.txt
c) Entire program log:
[Wed Aug 30 12:20:38 CDT 2023] MarkDuplicates --INPUT /mmfs1/projects/lab.name/RSSW_paper/RSSW_original_fastq/provingGrounds/bam_sorted/RSSW_15_0043_trimmed_bwaMem_readGroups_sorted.bam --OUTPUT /mmfs1/projects/lab.name/RSSW_paper/RSSW_original_fastq/provingGrounds/bam_dedup/RSSW_15_0043_trimmed_bwaMem_readGroups_sorted_dedup.bam --METRICS_FILE /mmfs1/projects/lab.name/RSSW_paper/RSSW_original_fastq/provingGrounds/bam_dedup/RSSW_15_0043_dedup_metrics.txt --VALIDATION_STRINGENCY STRICT --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Aug 30, 2023 12:20:38 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Wed Aug 30 12:20:38 CDT 2023] Executing as first.name@cmp0023 on Linux 4.18.0-348.12.2.el8_5.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_322-b06; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.1
INFO 2023-08-30 12:20:38 MarkDuplicates Start of doWork freeMemory: 2461802472; totalMemory: 2486697984; maxMemory: 28631367680
INFO 2023-08-30 12:20:38 MarkDuplicates Reading input file and constructing read end information.
INFO 2023-08-30 12:20:38 MarkDuplicates Will retain up to 103736839 data points before spilling to disk.
[Wed Aug 30 12:20:39 CDT 2023] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2651848704
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
-
Hi Gilgamesh
Can you check if your input bam file has any reads in it? This log looks like there is nothing inside therefore the tool stopped working immediately after running. Also are those named output folders are readily available before the tool starts running?
-
Hi Gökalp,
Thanks for the reply, the input file is 115 KB, and looks right. Also the paths are functioning and read right in the debug window.
Speaking of the input file, here are the first few lines and then a break to lines 2500
BAMÈ∞ @HD VN:1.6 SO:coordinate
@SQ SN:Ha412HOChr01 LN:159217232
@SQ SN:Ha412HOChr02 LN:184765313@SQ SN:Ha412HOChr00c25016 LN:
@PG ID:samtools PN:samtools VN:1.15 CL:samtools sort -o /mmfs1/projects/super.visor/RSSW_paper/RSSW_original_fastq/provingGrounds/bam_sorted/RSSW_15_0043_trimmed_bwaMem_readGroups_sorted.bam -O BAM -T tmp -
…a Ha412HOChr01Pv} Ha412HOChr02ÅK Ha412HOChr03ÜÚ⁄
Ha412HOChr04`U: Ha412HOChr05s¥+ Ha412HOChr06ÅÉc Ha412HOChr07»ø® Ha412HOChr08,õ
Ha412HOChr09ê£Ô Ha412HOChr10ï1a Ha412HOChr11p…„ Ha412HOChr12Üu´
Ha412HOChr13i≤xHere is the code for the prior step, it should be working.
bwa mem -t 16 -R '@RG\tID:RSSW\tSM:RSSW_15_0048\tPL:ILLUMINA' Ha412HOv2.0-20181130.fasta $OUTPUT_DIR/trimmed/${SAMPLE_NAME}_trimmed.fastq |\
samtools sort -o $OUTPUT_DIR/bam_sorted/${SAMPLE_NAME}_trimmed_bwaMem_readGroups_sorted.bam -O BAM -T tmp - -
Hi Gilgamesh
Can you run ValidateSamFile on your input bam? Since the bam file is too small is it possible for you to share one of them so we can try replicating the problem?
-
Hi Gökalp,
It should be noted that I am getting a core dump midway between the running of the BWA command with 32 threads and 128 gb memory. So the issue seems to be unreplicable, interestingly, the piping stops the core dump error but doesn't stop BWA from crashing, the core dump occurs at iteration 25016, and the piping to @PG ID:samtools PN:samtools VN:1.15 CL:samtools sort -o also occurred at that same line.
See the following info from the ValidateSamFile.
09:08:17.420 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mmfs1/apps/spack/0.16.1/linux-rhel8-zen2/gcc-10.2.0/gatk-4.1.8.1-2onk7tepebfsn3qaco7vyj5d2eiogkbn/bin/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Aug 31 09:08:17 CDT 2023] ValidateSamFile --INPUT /mmfs1/projects/brent.hulke/RSSW_paper/RSSW_original_fastq/provingGrounds/bam_sorted/RSSW_15_0043_trimmed_bwaMem_readGroups_sorted.bam --MODE SUMMARY --MAX_OUTPUT 100 --IGNORE_WARNINGS false --VALIDATE_INDEX true --INDEX_VALIDATION_STRINGENCY EXHAUSTIVE --IS_BISULFITE_SEQUENCED false --MAX_OPEN_TEMP_FILES 8000 --SKIP_MATE_VALIDATION false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Aug 31, 2023 9:08:17 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Thu Aug 31 09:08:17 CDT 2023] Executing as ethan.risman@cmp0023 on Linux 4.18.0-348.12.2.el8_5.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_322-b06; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.1
WARNING 2023-08-31 09:08:17 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
ERROR 2023-08-31 09:08:17 ValidateSamFile Number of sequences in text header (25032) != number of sequences in binary header (25033) for file /mmfs1/projects/brent.hulke/RSSW_paper/RSSW_original_fastq/provingGrounds/bam_sorted/RSSW_15_0043_trimmed_bwaMem_readGroups_sorted.bam
[Thu Aug 31 09:08:17 CDT 2023] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2293760000
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Tool returned:
-1Also, I can certainly provide the file. Would you like it copy pasted here? Or emailed somewhere?
-
Hi Gilgamesh
The coredump explains the issue. Your sam file is corrupt halfway therefore you get an incomplete file to begin with. It may be better if you check your fastq files as well since problem could be related to them being corrupt as well. If not it may be better to try with reduced number of threads for bwa. Also you may try with a non-piped workflow to see which step is really problematic. Also to make things more consistent can you add
set -o pipefail
before you start your pipe. This results in a core dump to stop both parts of the pipe.
I hope this helps.
Please sign in to leave a comment.
5 comments