MarkDuplicates not checking sequencesAnswered
I'm using MarkDuplicates tool on a SAM file and it looks like the tools does not compare the sequences of the reads but just compares their starting alignment position.
This is the reads of the original SAM file:
And after running MarkDuplicates (with no optional flags) the read 2 is marked as duplicate (as you can see the flag is now 1040=1024+16):
In my understanding of the documentation about markDuplicates (MarkDuplicates (Picard) – GATK (broadinstitute.org) ) where you write "The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file", the two reads could potentially be marked as duplicates (since read2 has 3 soft clipped bases and position 9995 so 9995-3=9992 which is the alignment position of read1) but once we compare the sequences (AACTGAGTAC in one case and CCCCCCCCCC in the other) the two reads should not be considered as duplicates from what I understood.
Do I misunderstand the definition explained in the post (and there is no need for comparing the sequences) or is there something else?
Hi Naomie Abecassis,
MarkDuplicates does not check if the sequences match, it only checks other information about the reads. Thanks for sharing that ambiguous documentation, I'll put in a request for our team to get that changed.
Do you have standard illumina data or some other type of sequencing? You should not use MarkDuplicates with amplicon data because all the sequences will get marked as duplicates.
Hope this helps!
Thank you Genevieve Brandt (she/her) for this clarification!
Please sign in to leave a comment.