Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

EstimateLibraryComplexityGATK (BETA) Follow

  • Metagenomics
  • PathSeqBuildKmers
  • PathSeqBuildReferenceTaxonomy
  • PathSeqBwaSpark
  • PathSeqFilterSpark
  • PathSeqPipelineSpark
  • PathSeqScoreSpark
  • Other
  • CreateHadoopBamSplittingIndex
  • FifoBuffer (Picard)
  • FixCallSetSampleOrdering
  • GatherBQSRReports
  • GatherTranches
  • IndexFeatureFile
  • ParallelCopyGCSDirectoryIntoHDFSSpark
  • Read Data Manipulation
  • AddCommentsToBam (Picard)
  • AddOrReplaceReadGroups (Picard)
  • ApplyBQSR
  • ApplyBQSRSpark
  • BQSRPipelineSpark
  • BamToBfq (Picard)
  • BaseRecalibrator
  • BaseRecalibratorSpark
  • BaseRecalibratorSparkSharded
  • BuildBamIndex (Picard)
  • BwaAndMarkDuplicatesPipelineSpark
  • BwaSpark
  • CleanSam (Picard)
  • ClipReads
  • ConvertHeaderlessHadoopBamShardToBam
  • DownsampleSam (Picard)
  • ExtractOriginalAlignmentRecordsByNameSpark
  • FastqToSam (Picard)
  • FilterSamReads (Picard)
  • FixMateInformation (Picard)
  • FixMisencodedBaseQualityReads
  • GatherBamFiles (Picard)
  • LeftAlignIndels
  • MarkDuplicates (Picard)
  • MarkDuplicatesGATK
  • MarkDuplicatesSpark
  • MarkDuplicatesWithMateCigar (Picard)
  • MergeBamAlignment (Picard)
  • MergeSamFiles (Picard)
  • PositionBasedDownsampleSam (Picard)
  • PrintReads
  • PrintReadsSpark
  • ReorderSam (Picard)
  • ReplaceSamHeader (Picard)
  • RevertBaseQualityScores
  • RevertOriginalBaseQualitiesAndAddMateCigar (Picard)
  • RevertSam (Picard)
  • SamFormatConverter (Picard)
  • SamToFastq (Picard)
  • SetNmAndUqTags (Picard)
  • SetNmMdAndUqTags (Picard)
  • SimpleMarkDuplicatesWithMateCigar (Picard)
  • SortReadFileSpark
  • SortSam (Picard)
  • SplitNCigarReads
  • SplitReads
  • SplitSamByLibrary (Picard)
  • SplitSamByNumberOfReads (Picard)
  • UmiAwareMarkDuplicatesWithMateCigar (Picard)
  • UnmarkDuplicates
  • Reference
  • BaitDesigner (Picard)
  • BwaMemIndexImageCreator
  • CreateSequenceDictionary (Picard)
  • ExtractSequences (Picard)
  • FindBadGenomicKmersSpark
  • NonNFastaSize (Picard)
  • NormalizeFasta (Picard)
  • ScatterIntervalsByNs (Picard)
  • Short Variant Discovery
  • CombineGVCFs
  • GenomicsDBImport
  • GenotypeGVCFs
  • HaplotypeCaller
  • HaplotypeCallerSpark
  • Mutect2
  • ReadsPipelineSpark
  • Structural Variant Discovery
  • DiscoverVariantsFromContigAlignmentsSAMSpark
  • ExtractSVEvidenceSpark
  • FindBreakpointEvidenceSpark
  • StructuralVariationDiscoveryPipelineSpark
  • SvDiscoverFromLocalAssemblyContigAlignmentsSpark
  • Variant Evaluation and Refinement
  • AnnotatePairOrientation
  • AnnotateVcfWithBamDepth
  • AnnotateVcfWithExpectedAlleleFraction
  • CalculateGenotypePosteriors
  • CalculateMixingFractions
  • Concordance
  • CountFalsePositives
  • CountVariants
  • CountVariantsSpark
  • FindMendelianViolations (Picard)
  • Funcotator
  • GenotypeConcordance (Picard)
  • ValidateBasicSomaticShortMutations
  • ValidateVariants
  • VariantsToTable
  • Variant Filtering
  • ApplyVQSR
  • CreateSomaticPanelOfNormals
  • FilterByOrientationBias
  • FilterMutectCalls
  • FilterVcf (Picard)
  • VariantFiltration
  • VariantRecalibrator
  • Variant Manipulation
  • FixVcfHeader (Picard)
  • GatherVcfs (Picard)
  • GatherVcfsCloud
  • LiftoverVcf (Picard)
  • MakeSitesOnlyVcf (Picard)
  • MergeVcfs (Picard)
  • PrintVariantsSpark
  • RemoveNearbyIndels
  • RenameSampleInVcf (Picard)
  • SelectVariants
  • SortVcf (Picard)
  • SplitVcfs (Picard)
  • UpdateVCFSequenceDictionary
  • UpdateVcfSequenceDictionary (Picard)
  • VcfFormatConverter (Picard)
  • VcfToIntervalList (Picard)

  • Base Calling
  • CheckIlluminaDirectory (Picard)
  • CollectIlluminaBasecallingMetrics (Picard)
  • CollectIlluminaLaneMetrics (Picard)
  • ExtractIlluminaBarcodes (Picard)
  • IlluminaBasecallsToFastq (Picard)
  • IlluminaBasecallsToSam (Picard)
  • MarkIlluminaAdapters (Picard)

  • Read Filters
  • AlignmentAgreesWithHeaderReadFilter
  • AllowAllReadsReadFilter
  • AmbiguousBaseReadFilter
  • CigarContainsNoNOperator
  • FirstOfPairReadFilter
  • FragmentLengthReadFilter
  • GoodCigarReadFilter
  • HasReadGroupReadFilter
  • LibraryReadFilter
  • MappedReadFilter
  • MappingQualityAvailableReadFilter
  • MappingQualityNotZeroReadFilter
  • MappingQualityReadFilter
  • MatchingBasesAndQualsReadFilter
  • MateDifferentStrandReadFilter
  • MateOnSameContigOrNoMappedMateReadFilter
  • MetricsReadFilter
  • NonZeroFragmentLengthReadFilter
  • NonZeroReferenceLengthAlignmentReadFilter
  • NotDuplicateReadFilter
  • NotSecondaryAlignmentReadFilter
  • NotSupplementaryAlignmentReadFilter
  • OverclippedReadFilter
  • PairedReadFilter
  • PassesVendorQualityCheckReadFilter
  • PlatformReadFilter
  • PlatformUnitReadFilter
  • PrimaryLineReadFilter
  • ProperlyPairedReadFilter
  • ReadGroupBlackListReadFilter
  • ReadGroupReadFilter
  • ReadLengthEqualsCigarLengthReadFilter
  • ReadLengthReadFilter
  • ReadNameReadFilter
  • ReadStrandFilter
  • SampleReadFilter
  • SecondOfPairReadFilter
  • SeqIsStoredReadFilter
  • ValidAlignmentEndReadFilter
  • ValidAlignmentStartReadFilter
  • WellformedReadFilter
  • Variant Annotations
  • AS_BaseQualityRankSumTest
  • AS_FisherStrand
  • AS_InbreedingCoeff
  • AS_MappingQualityRankSumTest
  • AS_QualByDepth
  • AS_RMSMappingQuality
  • AS_ReadPosRankSumTest
  • AS_StrandOddsRatio
  • BaseQuality
  • BaseQualityRankSumTest
  • ChromosomeCounts
  • ClippingRankSumTest
  • Coverage
  • DepthPerAlleleBySample
  • DepthPerSampleHC
  • ExcessHet
  • FisherStrand
  • FragmentLength
  • GenotypeSummaries
  • InbreedingCoeff
  • LikelihoodRankSumTest
  • MappingQuality
  • MappingQualityRankSumTest
  • MappingQualityZero
  • OxoGReadCounts
  • PossibleDeNovo
  • QualByDepth
  • RMSMappingQuality
  • ReadPosRankSumTest
  • ReadPosition
  • ReferenceBases
  • SampleList
  • StrandArtifact
  • StrandBiasBySample
  • StrandOddsRatio
  • TandemRepeat
  • UniqueAltReadCount
  • Estimate library complexity from the sequence of read pairs

    Category Diagnostics and Quality Control


    Overview

    Estimate library complexity from the sequence of read pairs

    The estimation is done by sorting all reads by the first N bases (defined by --min-identical-bases with default of 5) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default). The approach differs from that taken by Picard MarkDuplicates to estimate library complexity in that here alignment is not a factor.

    Reads of poor quality are filtered out so as to provide a more accurate estimate. The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than MIN_MEAN_QUALITY across either the first or second read. Unpaired reads are ignored in this computation.

    The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment to screen out technical reads one further filter is applied on the data. After examining all reads a Histogram is built of [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are then removed from the Histogram as outliers before library size is estimated.

    Input

    Output

    Usage Example

       gatk EstimateLibraryComplexityGATK \
         -I input.bam \
         -O metrics.txt
     

    EstimateLibraryComplexityGATK specific arguments

    This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

    Argument name(s) Default value Summary
    Required Arguments
    --input
     -I
    [] One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.
    --output
     -O
    null Output file to writes per-library metrics to.
    Optional Tool Arguments
    --arguments_file
    [] read one or more arguments files and add them to the command line
    --gcs-max-retries
     -gcs-retries
    20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
    --help
     -h
    false display the help message
    --max-diff-rate
    0.03 The maximum rate of differences between two reads to call them identical.
    --max-group-ratio
    500 Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.
    --min-identical-bases
    5 The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.
    --min-mean-quality
    20 The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.
    --OPTICAL_DUPLICATE_PIXEL_DISTANCE
    100 The maximum offset between two duplicate clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal.
    --READ_NAME_REGEX
    [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.
    --version
    false display the version number for this tool
    Optional Common Arguments
    --COMPRESSION_LEVEL
    5 Compression level for all compressed files created (e.g. BAM and GELI).
    --CREATE_INDEX
    false Whether to create a BAM index when writing a coordinate-sorted BAM file.
    --CREATE_MD5_FILE
    false Whether to create an MD5 digest for any BAM or FASTQ files created.
    --gatk-config-file
    null A configuration file to use with the GATK.
    --MAX_RECORDS_IN_RAM
    6197833 When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed.
    --QUIET
    false Whether to suppress job-summary info on System.err.
    --reference
     -R
    null Reference sequence file.
    --TMP_DIR
    [] Undocumented option
    --use-jdk-deflater
     -jdk-deflater
    false Whether to use the JdkDeflater (as opposed to IntelDeflater)
    --use-jdk-inflater
     -jdk-inflater
    false Whether to use the JdkInflater (as opposed to IntelInflater)
    --VALIDATION_STRINGENCY
    STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
    --verbosity
    INFO Control verbosity of logging.
    Advanced Arguments
    --showHidden
    false display hidden arguments

    Argument details

    Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


    --arguments_file / NA

    read one or more arguments files and add them to the command line

    List[File]  []


    --COMPRESSION_LEVEL / NA

    Compression level for all compressed files created (e.g. BAM and GELI).

    int  5  [ [ -∞  ∞ ] ]


    --CREATE_INDEX / NA

    Whether to create a BAM index when writing a coordinate-sorted BAM file.

    Boolean  false


    --CREATE_MD5_FILE / NA

    Whether to create an MD5 digest for any BAM or FASTQ files created.

    boolean  false


    --gatk-config-file / NA

    A configuration file to use with the GATK.

    String  null


    --gcs-max-retries / -gcs-retries

    If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

    int  20  [ [ -∞  ∞ ] ]


    --help / -h

    display the help message

    boolean  false


    --input / -I

    One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.

    R List[File]  []


    --max-diff-rate / NA

    The maximum rate of differences between two reads to call them identical.

    double  0.03  [ [ -∞  ∞ ] ]


    --max-group-ratio / NA

    Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.

    int  500  [ [ -∞  ∞ ] ]


    --MAX_RECORDS_IN_RAM / NA

    When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed.

    Integer  6197833  [ [ -∞  ∞ ] ]


    --min-identical-bases / NA

    The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.

    int  5  [ [ -∞  ∞ ] ]


    --min-mean-quality / NA

    The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.

    int  20  [ [ -∞  ∞ ] ]


    --OPTICAL_DUPLICATE_PIXEL_DISTANCE / NA

    The maximum offset between two duplicate clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal.

    int  100  [ [ -∞  ∞ ] ]


    --output / -O

    Output file to writes per-library metrics to.

    R File  null


    --QUIET / NA

    Whether to suppress job-summary info on System.err.

    Boolean  false


    --READ_NAME_REGEX / NA

    Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

    String  [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*


    --reference / -R

    Reference sequence file.

    File  null


    --showHidden / -showHidden

    display hidden arguments

    boolean  false


    --TMP_DIR / NA

    Undocumented option

    List[File]  []


    --use-jdk-deflater / -jdk-deflater

    Whether to use the JdkDeflater (as opposed to IntelDeflater)

    boolean  false


    --use-jdk-inflater / -jdk-inflater

    Whether to use the JdkInflater (as opposed to IntelInflater)

    boolean  false


    --VALIDATION_STRINGENCY / NA

    Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

    The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

    STRICT
    LENIENT
    SILENT

    ValidationStringency  STRICT


    --verbosity / -verbosity

    Control verbosity of logging.

    The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

    ERROR
    WARNING
    INFO
    DEBUG

    LogLevel  INFO


    --version / NA

    display the version number for this tool

    boolean  false


    Return to top


    GATK version 4.0.1.2 built at 25-21-2019 04:21:59.

    0 comments

    Please sign in to leave a comment.

    Powered by Zendesk