MarkDuplicates has different duplication metrics than EstimateLibraryComplexity
Why do I see different QC metrics for MarkDuplicates and EstimateLibraryComplexity when running them both on the same files? From the documentation it seems like they do the same thing in calling DuplicationMetrics, but I am seeing very different results for my bam files. My samples are RNA-seq bam files which have been sorted by coordinate. I was also reading that EstimateLibraryComplexity sorts reads by the first N base pairs, then calculates library complexity, though I don't believe MarkDuplicates does this. Does it make sense for the difference to be so stark because of this difference in sorting? And if so which results should I use when assessing library size and duplication percent? I am using version 2.23.8.
For example,
EstimateLibraryComplexity QC output:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READSUNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown 0 3046602 0 0 0 721095 21475 0.236688 5487329
MarkDuplicates QC output:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READSUNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 14398412 7743002 2806514 30095057 13735236 6371215 83668 0.886002 1377076
The exact commands I am using are
java -jar ~/Documents/scripts/picard.jar EstimateLibraryComplexity -I 1_shox2.bam -O 1_shox2.txt
java -jar ~/Documents/scripts/picard.jar MarkDuplicates -I 1_shox2.bam -O 1_shox2-nodups.bam -M 1-shox2.log -REMOVE_DUPLICATES true
-
Hi Ravi Mandla, we would expect some differences in these metrics because the tools do not work the same. MarkDuplicates uses alignment information to determine duplicates. EstimateLibraryComplexity determines duplicates from the bases of the reads, allowing for some error, ignoring the reference. Hopefully this helps clarify these differences for you!
Please sign in to leave a comment.
1 comment