EstimateLibraryComplexityGATK (BETA) – GATK

Estimate library complexity from the sequence of read pairs

Category Diagnostics and Quality Control

Overview

Estimate library complexity from the sequence of read pairs

The estimation is done by sorting all reads by the first N bases (defined by --min-identical-bases with default of 5) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default). The approach differs from that taken by Picard MarkDuplicates to estimate library complexity in that here alignment is not a factor.

Reads of poor quality are filtered out so as to provide a more accurate estimate. The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than MIN_MEAN_QUALITY across either the first or second read. Unpaired reads are ignored in this computation.

The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment to screen out technical reads one further filter is applied on the data. After examining all reads a Histogram is built of [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are then removed from the Histogram as outliers before library size is estimated.

Input

A BAM or CRAM file containing aligned read data.

Output

A text file with per-library complexity metrics

Usage Example

   gatk EstimateLibraryComplexityGATK \
     -I input.bam \
     -O metrics.txt

EstimateLibraryComplexityGATK specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Arguments
--input -I	[]	One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.
--output -O	null	Output file to writes per-library metrics to.
Optional Tool Arguments
--arguments_file	[]	read one or more arguments files and add them to the command line
--gcs-max-retries -gcs-retries	20	If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--help -h	false	display the help message
--max-diff-rate	0.03	The maximum rate of differences between two reads to call them identical.
--max-group-ratio	500	Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.
--min-identical-bases	5	The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.
--min-mean-quality	20	The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.
--OPTICAL_DUPLICATE_PIXEL_DISTANCE	100	The maximum offset between two duplicate clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal.
--READ_NAME_REGEX	[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*	Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.
--version	false	display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL	5	Compression level for all compressed files created (e.g. BAM and GELI).
--CREATE_INDEX	false	Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE	false	Whether to create an MD5 digest for any BAM or FASTQ files created.
--gatk-config-file	null	A configuration file to use with the GATK.
--MAX_RECORDS_IN_RAM	6197833	When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed.
--QUIET	false	Whether to suppress job-summary info on System.err.
--reference -R	null	Reference sequence file.
--TMP_DIR	[]	Undocumented option
--use-jdk-deflater -jdk-deflater	false	Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater -jdk-inflater	false	Whether to use the JdkInflater (as opposed to IntelInflater)
--VALIDATION_STRINGENCY	STRICT	Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--verbosity	INFO	Control verbosity of logging.
Advanced Arguments
--showHidden	false	display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--arguments_file / NA

read one or more arguments files and add them to the command line

List[File] []

--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and GELI).

int 5 [ [ -∞ ∞ ] ]

--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean false

--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean false

--gatk-config-file / NA

A configuration file to use with the GATK.

String null

--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int 20 [ [ -∞ ∞ ] ]

--help / -h

display the help message

boolean false

--input / -I

One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.

R List[File] []

--max-diff-rate / NA

The maximum rate of differences between two reads to call them identical.

double 0.03 [ [ -∞ ∞ ] ]

--max-group-ratio / NA

Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.

int 500 [ [ -∞ ∞ ] ]

--MAX_RECORDS_IN_RAM / NA

When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed.

Integer 6197833 [ [ -∞ ∞ ] ]

--min-identical-bases / NA

The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.

int 5 [ [ -∞ ∞ ] ]

--min-mean-quality / NA

The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.

int 20 [ [ -∞ ∞ ] ]

--OPTICAL_DUPLICATE_PIXEL_DISTANCE / NA

The maximum offset between two duplicate clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal.

int 100 [ [ -∞ ∞ ] ]

--output / -O

Output file to writes per-library metrics to.

R File null

--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean false

--READ_NAME_REGEX / NA

Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

String [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*

--reference / -R

Reference sequence file.

File null

--showHidden / -showHidden

display hidden arguments

boolean false

--TMP_DIR / NA

Undocumented option

List[File] []

--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean false

--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean false

--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency STRICT

--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel INFO

--version / NA

display the version number for this tool

boolean false

Return to top

GATK version 4.0.2.0 built at 02-13-2019 02:13:49.

Genome Analysis Toolkit

Need Help?

Community Forum

Articles in this section

EstimateLibraryComplexityGATK (BETA) Follow

Category Diagnostics and Quality Control

Overview

Input

Output

Usage Example

EstimateLibraryComplexityGATK specific arguments

Argument details

--arguments_file / NA

--COMPRESSION_LEVEL / NA

--CREATE_INDEX / NA

--CREATE_MD5_FILE / NA

--gatk-config-file / NA

--gcs-max-retries / -gcs-retries

--help / -h

--input / -I

--max-diff-rate / NA

--max-group-ratio / NA

--MAX_RECORDS_IN_RAM / NA

--min-identical-bases / NA

--min-mean-quality / NA

--OPTICAL_DUPLICATE_PIXEL_DISTANCE / NA

--output / -O

--QUIET / NA

--READ_NAME_REGEX / NA

--reference / -R

--showHidden / -showHidden

--TMP_DIR / NA

--use-jdk-deflater / -jdk-deflater

--use-jdk-inflater / -jdk-inflater

--VALIDATION_STRINGENCY / NA

--verbosity / -verbosity

--version / NA

0 comments