No duplicates found with MarkDuplicates on uBAM made from shotgun Illumina PE plant data, FastQC and Pardre show presence of duplicates
So FasQC reports duplicates, which we try to deduplicate on Picard installed via conda.
uBAM created using this guide: https://gatk.broadinstitute.org/hc/en-us/articles/4403687183515--How-to-Generate-an-unmapped-BAM-from-FASTQ-or-aligned-BAM
Only Unmapped reads (58161732) are seen in the final Duplication metrics (attached below).
Running the deduplication using Pardre (https://sourceforge.net/projects/pardre/) results in Non duplicated paired reads 28712319/29080866 (98.73 %), so there are certainly duplicates present.
Is there any mistake with the uBAM creation ? Or are there any parameters which can be played with ?
Full logs below, many thanks !
a) picard Version:3.3.0:
b) Exact command used:
picard FastqToSam F1=P07_BG0041_FKDL240099841-1A_HKHJFDSXC_L3_1_noad.fq.gz F2=P07_BG0041_FKDL240099841-1A_HKHJFDSXC_L3_2_noad.fq.gz O=BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs.bam SM=BG0041 RG=FKDL240099841-1A_HKHJFDSXC_L3 LIBRARY_NAME=P07 PLATFORM=illumina
c) Entire program log:
INFO 2025-03-20 12:11:03 FastqToSam
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
**********
https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** FastqToSam -F1 P07_BG0041_FKDL240099841-1A_HKHJFDSXC_L3_1_noad.fq.gz -F2 P07_BG0041_FKDL240099841-1A_HKHJFDSXC_L3_2_noad.fq.gz -O BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs.bam -SM BG0041 -RG FKDL240099841-1A_HKHJFDSXC_L3 -LIBRARY_NAME P07 -PLATFORM illumina
**********
12:11:03.148 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/omk/minicondanew/envs/picard/share/picard-3.3.0-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Mar 20 12:11:03 CET 2025] FastqToSam FASTQ=P07_BG0041_FKDL240099841-1A_HKHJFDSXC_L3_1_noad.fq.gz FASTQ2=P07_BG0041_FKDL240099841-1A_HKHJFDSXC_L3_2_noad.fq.gz OUTPUT=BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs.bam READ_GROUP_NAME=FKDL240099841-1A_HKHJFDSXC_L3 SAMPLE_NAME=BG0041 LIBRARY_NAME=P07 PLATFORM=illumina USE_SEQUENTIAL_FASTQS=false SORT_ORDER=queryname MIN_Q=0 MAX_Q=93 STRIP_UNPAIRED_MATE_NUMBER=false ALLOW_AND_IGNORE_EMPTY_LINES=false ALLOW_EMPTY_FASTQ=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Mar 20 12:11:03 CET 2025] Executing as omk@gcfmax on Linux 6.5.6-76060506-generic amd64; OpenJDK 64-Bit Server VM 23.0.2-internal-adhoc.conda.src; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 3.3.0
INFO 2025-03-20 12:11:03 FastqToSam Auto-detected quality format as: Standard.
INFO 2025-03-20 12:11:06 FastqToSam Processed 1,000,000 records. Elapsed time: 00:00:02s. Time for last 1,000,000: 2s. Last read position: */*
INFO 2025-03-20 12:11:09 FastqToSam Processed 2,000,000 records. Elapsed time: 00:00:06s. Time for last 1,000,000: 3s. Last read position: */*
=
=
=
INFO 2025-03-20 12:14:21 FastqToSam Processed 57,000,000 records. Elapsed time: 00:03:17s. Time for last 1,000,000: 3s. Last read position: */*
INFO 2025-03-20 12:14:24 FastqToSam Processed 58,000,000 records. Elapsed time: 00:03:21s. Time for last 1,000,000: 3s. Last read position: */*
INFO 2025-03-20 12:14:25 FastqToSam Processed 29080866 fastq reads
[Thu Mar 20 12:19:18 CET 2025] picard.sam.FastqToSam done. Elapsed time: 8.25 minutes.
Then followed by:
a) picard Version:3.3.0
b) Exact command used:
picard MarkDuplicates --INPUT BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs.bam --OUTPUT BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped.sam --REMOVE_DUPLICATES true --METRICS_FILE BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped_log.txt
c) Entire program log:
12:26:59.113 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/omk/minicondanew/envs/picard/share/picard-3.3.0-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Mar 20 12:26:59 CET 2025] MarkDuplicates --INPUT BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs.bam --OUTPUT BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped.sam --METRICS_FILE BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped_log.txt --REMOVE_DUPLICATES true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_DUP_STRATEGY FLOW_QUALITY_SUM_STRATEGY --FLOW_USE_END_IN_UNPAIRED_READS false --FLOW_USE_UNPAIRED_CLIPPED_END false --FLOW_UNPAIRED_END_UNCERTAINTY 0 --FLOW_UNPAIRED_START_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT--COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Mar 20 12:26:59 CET 2025] Executing as omk@gcfmax on Linux 6.5.6-76060506-generic amd64; OpenJDK 64-Bit Server VM 23.0.2-internal-adhoc.conda.src; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:3.3.0
INFO 2025-03-20 12:26:59 MarkDuplicates Start of doWork freeMemory: 527025104; totalMemory: 536870912; maxMemory: 2147483648
INFO 2025-03-20 12:26:59 MarkDuplicates Reading input file and constructing read end information.
INFO 2025-03-20 12:26:59 MarkDuplicates Will retain up to 7780737 data points before spilling to disk.
INFO 2025-03-20 12:27:00 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:00s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:00 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:00 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:01s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:00 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:01 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:00:02s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:01 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:02 MarkDuplicates Read 4,000,000 records. Elapsed time: 00:00:03s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:02 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:03 MarkDuplicates Read 5,000,000 records. Elapsed time: 00:00:03s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:03 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
=
=
=
INFO 2025-03-20 12:27:39 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:40 MarkDuplicates Read 56,000,000 records. Elapsed time: 00:00:41s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:40 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:41 MarkDuplicates Read 57,000,000 records. Elapsed time: 00:00:41s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:41 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:41 MarkDuplicates Read 58,000,000 records. Elapsed time: 00:00:42s. Time for last 1,000,000: 0s. Last read position: */*
INFO 2025-03-20 12:27:41 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2025-03-20 12:27:42 MarkDuplicates Read 58161732 records. 0 pairs never matched.
INFO 2025-03-20 12:27:42 MarkDuplicates After buildSortedReadEndLists freeMemory: 463646208; totalMemory: 536870912; maxMemory: 2147483648
INFO 2025-03-20 12:27:42 MarkDuplicates Will retain up to 67108864 duplicate indices before spilling to disk.
INFO 2025-03-20 12:27:42 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2025-03-20 12:27:42 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2025-03-20 12:27:42 MarkDuplicates Sorting list of duplicate records.
INFO 2025-03-20 12:27:42 MarkDuplicates After generateDuplicateIndexes freeMemory: 525508688; totalMemory: 1073741824; maxMemory: 2147483648
INFO 2025-03-20 12:27:42 MarkDuplicates Marking 0 records as duplicates.
INFO 2025-03-20 12:27:42 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2025-03-20 12:27:42 MarkDuplicates Reads are assumed to be ordered by: queryname
INFO 2025-03-20 12:27:59 MarkDuplicates Written 10,000,000 records. Elapsed time: 00:00:17s. Time for last 10,000,000: 17s. Last read position: */*
INFO 2025-03-20 12:28:16 MarkDuplicates Written 20,000,000 records. Elapsed time: 00:00:34s. Time for last 10,000,000: 17s. Last read position: */*
INFO 2025-03-20 12:28:33 MarkDuplicates Written 30,000,000 records. Elapsed time: 00:00:51s. Time for last 10,000,000: 17s. Last read position: */*
INFO 2025-03-20 12:28:51 MarkDuplicates Written 40,000,000 records. Elapsed time: 00:01:08s. Time for last 10,000,000: 17s. Last read position: */*
INFO 2025-03-20 12:29:08 MarkDuplicates Written 50,000,000 records. Elapsed time: 00:01:25s. Time for last 10,000,000: 17s. Last read position: */*
INFO 2025-03-20 12:29:22 MarkDuplicates Writing complete. Closing input iterator.
INFO 2025-03-20 12:29:22 MarkDuplicates Duplicate Index cleanup.
INFO 2025-03-20 12:29:22 MarkDuplicates Getting Memory Stats.
INFO 2025-03-20 12:29:22 MarkDuplicates Before output close freeMemory: 526335448; totalMemory: 536870912; maxMemory: 2147483648
INFO 2025-03-20 12:29:22 MarkDuplicates Closed outputs. Getting more Memory Stats.
INFO 2025-03-20 12:29:22 MarkDuplicates After output close freeMemory: 526335448; totalMemory: 536870912; maxMemory: 2147483648
[Thu Mar 20 12:29:22 CET 2025] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 2.39 minutes.
Runtime.totalMemory()=536870912
METRICS_FILE BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped_log.txt
## htsjdk.samtools.metrics.StringHeader
# MarkDuplicates --INPUT BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs.bam --OUTPUT BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped.sam --METRICS_FILE BG0041_FKDL240099841-1A_HKHJFDSXC_L3_unaligned_read_pairs_deduped_log.txt --REMOVE_DUPLICATES true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_DUP_STRATEGY FLOW_QUALITY_SUM_STRATEGY --FLOW_USE_END_IN_UNPAIRED_READS false --FLOW_USE_UNPAIRED_CLIPPED_END false --FLOW_UNPAIRED_END_UNCERTAINTY 0 --FLOW_UNPAIRED_START_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
## htsjdk.samtools.metrics.StringHeader
# Started on: Thu Mar 20 12:26:59 CET 2025
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
P07 0 0 0 58161732 0 0 0 0
-
Hi Om Kulkarni
Picard MarkDuplicates requires mapped reads to work on. It cannot work with unmapped reads. You need to map your reads first before using this tool.
I hope this helps.
Regards.
Please sign in to leave a comment.
1 comment