MarkDuplicates TAG_DUPLICATE_SET_MEMBERS=true does not add DS and DI tags in the output BAM file
AnsweredHello GATK community,
I am running:
java -jar picard.2.26.3.jar MarkDuplicates I=input.coordsorted.bam O=output.markduplicates.bam M=metrics.txt TAG_DUPLICATE_SET_MEMBERS=true
To get a DI tag associated to each duplicate (I want to know which duplicates originate from the same molecule, including the primary duplicate).
I cannot find the DI tag (nor the DS tag) in the output.bam. Picard does not return any error message.
ValidateSamFile confirm that the input has no error.
Anyone knows why the DI tag is not present or how I could get it?
Many thanks
Adeline
-
Hi Adeline Morez,
Thanks for writing into the forum! This seems like it could be a bug. Could you share your log from the MarkDuplicates command to confirm that something else strange isn't happening?
Best,
Genevieve
-
Hi Genevieve,
Many thanks for your reply. You can find the log below.
Best,
Adeline
INFO 2021-10-27 14:53:43 MarkDuplicates
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** MarkDuplicates -I input.coordsorted.bam -O output.markduplicates.bam -M metrics.txt -TAG_DUPLICATE_SET_MEMBERS true
**********
14:53:44.571 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/nspamore/software/picard-2.26.3/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Oct 27 14:53:44 BST 2021] MarkDuplicates TAG_DUPLICATE_SET_MEMBERS=true INPUT=[input.coordsorted.bam] OUTPUT=output.markduplicates.bam METRICS_FILE=metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Oct 27 14:53:44 BST 2021] Executing as nspamore@genome.jmu.ac.uk on Linux 3.10.0-1160.25.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_312-b07; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.26.3
INFO 2021-10-27 14:53:44 MarkDuplicates Start of doWork freeMemory: 2037170912; totalMemory: 2058354688; maxMemory: 28631367680
INFO 2021-10-27 14:53:44 MarkDuplicates Reading input file and constructing read end information.
INFO 2021-10-27 14:53:44 MarkDuplicates Will retain up to 103736839 data points before spilling to disk.
INFO 2021-10-27 14:53:51 MarkDuplicates Read 83777 records. 0 pairs never matched.
INFO 2021-10-27 14:53:52 MarkDuplicates After buildSortedReadEndLists freeMemory: 1207695960; totalMemory: 2058354688; maxMemory: 28631367680
INFO 2021-10-27 14:53:52 MarkDuplicates Will retain up to 447365120 duplicate indices before spilling to disk.
INFO 2021-10-27 14:53:56 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2021-10-27 14:53:56 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2021-10-27 14:53:57 MarkDuplicates Sorting list of duplicate records.
INFO 2021-10-27 14:53:58 MarkDuplicates After generateDuplicateIndexes freeMemory: 2570584264; totalMemory: 7964983296; maxMemory: 28631367680
INFO 2021-10-27 14:53:58 MarkDuplicates Marking 56068 records as duplicates.
INFO 2021-10-27 14:53:58 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2021-10-27 14:53:58 MarkDuplicates Reads are assumed to be ordered by: coordinate
INFO 2021-10-27 14:54:00 MarkDuplicates Writing complete. Closing input iterator.
INFO 2021-10-27 14:54:00 MarkDuplicates Duplicate Index cleanup.
INFO 2021-10-27 14:54:00 MarkDuplicates Representative read Index cleanup.
INFO 2021-10-27 14:54:00 MarkDuplicates Getting Memory Stats.
INFO 2021-10-27 14:54:01 MarkDuplicates Before output close freeMemory: 6149059176; totalMemory: 7964983296; maxMemory: 28631367680
INFO 2021-10-27 14:54:01 MarkDuplicates Closed outputs. Getting more Memory Stats.
INFO 2021-10-27 14:54:04 MarkDuplicates After output close freeMemory: 6149432648; totalMemory: 7964983296; maxMemory: 28631367680
[Wed Oct 27 14:54:04 BST 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.34 minutes.
Runtime.totalMemory()=7964983296 -
Thanks Adeline Morez.
I have created an issue ticket in the Picard repository here: https://github.com/broadinstitute/picard/issues/1741. The developers will take a closer look and provide fixes to this bug there.
Thank you for bringing this to our attention!
Genevieve
Please sign in to leave a comment.
3 comments