Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Confusion regarding the Picard EstimateLibraryComplexity utility

0

1 comment

  • Avatar
    Chris Kachulis

    Hi Thomas Bradley,

     

    Once the number of unique read pairs, and total number of read pairs are found, the following equation is used:
    C/X = 1 - exp( -N/X )
    where
    X = number of distinct molecules in library ("library complexity")
    N = total number of read pairs
    C = total number of unique read pairs

    This equation comes from modeling the process as selection with replacement from a large number of equally sized populations of fragments.  Modeling as selection with replacement is valid if the number of copies of each molecule is very large, so a few being selected for sequencing doesn't have a significant effect on the overall distribution of DNA molecules.  Under those assumptions, the probability of at least one copy of a particular molecule being selected for sequencing is 1-(1-1/x)^N.  In the limit that X>>1, this becomes 1-exp(-N/X).  The assumptions are certainly oversimplifications, but in general this equation still give useful QC and library related information.

    Since this equation is not analytically solvable for X, picard uses the bisection method for at most 40 iterations to find a solution.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk