Confusion regarding the Picard EstimateLibraryComplexity utility
Hello,
I am writing to inquire about the EstimateLibraryComplexity picard utility.
The documentation for this tool describes how the method determines whether read pairs are duplicates or not and how it filters low quality reads, but it does not seem to describe the general algorithm by which library complexity is estimated from a set of high quality reads marked as either duplicate or non-duplicate.
Any extra information on how the core algorithm for this method works would be greatly appreciated
Best wishes,
Thomas
REQUIRED for all errors and issues:
a) GATK version used: picard Version:3.0.0
b) Exact command used: 'picard EstimateLibraryComplexity I={input} O={output}'
c) Entire program log: N/A
-
Hi Thomas Bradley,
Once the number of unique read pairs, and total number of read pairs are found, the following equation is used:
C/X = 1 - exp( -N/X )
where
X = number of distinct molecules in library ("library complexity")
N = total number of read pairs
C = total number of unique read pairsThis equation comes from modeling the process as selection with replacement from a large number of equally sized populations of fragments. Modeling as selection with replacement is valid if the number of copies of each molecule is very large, so a few being selected for sequencing doesn't have a significant effect on the overall distribution of DNA molecules. Under those assumptions, the probability of at least one copy of a particular molecule being selected for sequencing is 1-(1-1/x)^N. In the limit that X>>1, this becomes 1-exp(-N/X). The assumptions are certainly oversimplifications, but in general this equation still give useful QC and library related information.
Since this equation is not analytically solvable for X, picard uses the bisection method for at most 40 iterations to find a solution.
Please sign in to leave a comment.
1 comment