I'm using MarkDuplicates tool on a SAM file and it looks like the tools does not compare the sequences of the reads but just compares their starting alignment position.
This is the reads of the original SAM file:
And after running MarkDuplicates (with no optional flags) the read 2 is marked as duplicate (as you can see the flag is now 1040=1024+16):
In my understanding of the documentation about markDuplicates (MarkDuplicates (Picard) – GATK (broadinstitute.org) ) where you write "The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file", the two reads could potentially be marked as duplicates (since read2 has 3 soft clipped bases and position 9995 so 9995-3=9992 which is the alignment position of read1) but once we compare the sequences (AACTGAGTAC in one case and CCCCCCCCCC in the other) the two reads should not be considered as duplicates from what I understood.
Do I misunderstand the definition explained in the post (and there is no need for comparing the sequences) or is there something else?
Please sign in to leave a comment.