Program to collect error metrics on bases stratified in various ways.
Sequencing errors come in different 'flavors'. For example, some occur during sequencing while others happen during library construction, prior to the sequencing. They may be correlated with various aspect of the sequencing experiment: position in the read, base context, length of insert and so on.
This program collects two different kinds of error metrics (one which attempts to distinguish between pre- and post- sequencer errors, and on which doesn't) and a collation of 'stratifiers' each of which assigns bases into various bins. The stratifiers can be used together to generate a composite stratification.
For example:
The BASE_QUALITY stratifier will place bases in bins according to their declared base quality. The READ_ORDINALITY stratifier will place bases in one of two bins depending on whether their read is 'first' or 'second'. One could generate a composite stratifier BASE_QUALITY:READ_ORDINALITY which will do both stratifications as the same time.
The resulting metric file will be named according to a provided prefix and a suffix which is generated automatically according to the error metric. The tool can collect multiple metrics in a single pass and there should be hardly any performance loss when specifying multiple metrics at the same time; the default includes a large collection of metrics.
To estimate the error rate the tool assumes that all differences from the reference are errors. For this to be a reasonable assumption the tool needs to know the sites at which the sample is actually polymorphic and a confidence interval where the user is relatively certain that the polymorphic sites are known and accurate. These two inputs are provided as a VCF and INTERVALS. The program will only process sites that are in the intersection of the interval lists in the INTERVALS argument as long as they are not polymorphic in the VCF.
Category Diagnostics and Quality Control
Overview
Program to collect error metrics on bases stratified in various ways.CollectSamErrorMetrics (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--INPUT -I |
Input SAM or BAM file. | ||
--OUTPUT -O |
Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE. | ||
--REFERENCE_SEQUENCE -R |
Reference sequence file. | ||
--VCF -V |
VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci. | ||
Optional Tool Arguments | |||
--arguments_file |
read one or more arguments files and add them to the command line | ||
--ERROR_METRICS |
[ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT, INDEL_ERROR] | Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE. | |
--ERROR_VALUE |
A fake argument used to show the options of ERROR (in ERROR_METRICS). | ||
--help -h |
false | display the help message | |
--INTERVAL_ITERATOR |
false | Iterate through the file assuming it consists of a pre-created subset interval of the full genome. This enables fast processing of files with reads at disperate parts of the genome. Requires that the provided VCF file is indexed. | |
--INTERVALS -L |
Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will *intersect* inputs if multiple are given. When this argument is supplied, the VCF provided must be *indexed*. | ||
--LOCATION_BIN_SIZE -LBS |
2500 | Size of location bins. Used by the FLOWCELL_X and FLOWCELL_Y stratifiers | |
--LONG_HOMOPOLYMER -LH |
6 | Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier. | |
--MAX_LOCI -MAX |
0 | Maximum number of loci to process (or unlimited if 0). | |
--MIN_BASE_Q -BQ |
20 | Minimum base quality to include base. | |
--MIN_MAPPING_Q -MQ |
20 | Minimum mapping quality to include read. | |
--PRIOR_Q -PE |
30 | The prior error, in phred-scale (used for calculating empirical error rates). | |
--PROBABILITY -P |
1.0 | The probability of selecting a locus for analysis (for downsampling). | |
--PROGRESS_STEP_INTERVAL |
100000 | The interval between which progress will be displayed. | |
--STRATIFIER_VALUE |
A fake argument used to show the options of STRATIFIER (in ERROR_METRICS). | ||
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create an index when writing VCF or coordinate sorted BAM output. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
One or more directories with space available to be used by this program for temporary storage of working files | ||
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--COMPRESSION_LEVEL
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX
Whether to create an index when writing VCF or coordinate sorted BAM output.
Boolean false
--CREATE_MD5_FILE
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--ERROR_METRICS
Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE.
List[String] [ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT, INDEL_ERROR]
--ERROR_VALUE
A fake argument used to show the options of ERROR (in ERROR_METRICS).
The --ERROR_VALUE argument is an enumerated type (ErrorType), which can have one of the following values:
- ERROR
- Collects the average (SNP) error at the bases provided. Suffix is: 'error'.
- OVERLAPPING_ERROR
- Only considers bases from the overlapping parts of reads from the same template. For those bases, it calculates the error that can be attributable to pre-sequencing, versus during-sequencing. Suffix is: 'overlapping_error'.
- INDEL_ERROR
- Collects insertion and deletion errors at the bases provided. Suffix is: 'indel_error'.
ErrorType null
--GA4GH_CLIENT_SECRETS
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--INPUT / -I
Input SAM or BAM file.
R String null
--INTERVAL_ITERATOR
Iterate through the file assuming it consists of a pre-created subset interval of the full genome. This enables fast processing of files with reads at disperate parts of the genome. Requires that the provided VCF file is indexed.
boolean false
--INTERVALS / -L
Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will *intersect* inputs if multiple are given. When this argument is supplied, the VCF provided must be *indexed*.
List[File] []
--LOCATION_BIN_SIZE / -LBS
Size of location bins. Used by the FLOWCELL_X and FLOWCELL_Y stratifiers
int 2500 [ [ -∞ ∞ ] ]
--LONG_HOMOPOLYMER / -LH
Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier.
int 6 [ [ -∞ ∞ ] ]
--MAX_LOCI / -MAX
Maximum number of loci to process (or unlimited if 0).
long 0 [ [ -∞ ∞ ] ]
--MAX_RECORDS_IN_RAM
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MIN_BASE_Q / -BQ
Minimum base quality to include base.
int 20 [ [ -∞ ∞ ] ]
--MIN_MAPPING_Q / -MQ
Minimum mapping quality to include read.
int 20 [ [ -∞ ∞ ] ]
--OUTPUT / -O
Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE.
R File null
--PRIOR_Q / -PE
The prior error, in phred-scale (used for calculating empirical error rates).
int 30 [ [ -∞ ∞ ] ]
--PROBABILITY / -P
The probability of selecting a locus for analysis (for downsampling).
double 1.0 [ [ -∞ ∞ ] ]
--PROGRESS_STEP_INTERVAL
The interval between which progress will be displayed.
int 100000 [ [ -∞ ∞ ] ]
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--REFERENCE_SEQUENCE / -R
Reference sequence file.
R File null
--showHidden / -showHidden
display hidden arguments
boolean false
--STRATIFIER_VALUE
A fake argument used to show the options of STRATIFIER (in ERROR_METRICS).
The --STRATIFIER_VALUE argument is an enumerated type (Stratifier), which can have one of the following values:
- ALL
- Puts all bases in the same stratum. Suffix is 'all'.
- GC_CONTENT
- The GC-content of the read. Suffix is 'gc'.
- READ_ORDINALITY
- The read ordinality (i.e. first or second). Suffix is 'read_ordinality'.
- READ_BASE
- the base in the original reading direction. Suffix is 'read_base'.
- READ_DIRECTION
- The alignment direction of the read (encoded as + or -). Suffix is 'read_direction'.
- PAIR_ORIENTATION
- The read-pair's orientation (encoded as '[FR]1[FR]2'). Suffix is 'pair_orientation'.
- PAIR_PROPERNESS
- The properness of the read-pair's alignment. Looks for indications of chimerism. Suffix is 'pair_proper'.
- REFERENCE_BASE
- The reference base in the read's direction. Suffix is 'ref_base'.
- PRE_DINUC
- The read base at the previous cycle, and the current reference base. Suffix is 'pre_dinuc'.
- POST_DINUC
- The read base at the subsequent cycle, and the current reference base. Suffix is 'post_dinuc'.
- HOMOPOLYMER_LENGTH
- The length of homopolymer the base is part of (only accounts for bases that were read prior to the current base). Suffix is 'homopolymer_length'.
- HOMOPOLYMER
- The length of homopolymer, the base that the homopolymer is comprised of, and the reference base. Suffix is 'homopolymer_and_following_ref_base'.
- BINNED_HOMOPOLYMER
- The scale of homopolymer (long or short), the base that the homopolymer is comprised of, and the reference base. Suffix is 'binned_length_homopolymer_and_following_ref_base'.
- FLOWCELL_TILE
- The flowcell and tile where the base was read (taken from the read name). Suffix is 'tile'.
- FLOWCELL_Y
- The y-coordinate of the read (taken from the read name) Suffix is 'y'.
- FLOWCELL_X
- The x-coordinate of the read (taken from the read name) Suffix is 'x'.
- READ_GROUP
- The read-group id of the read. Suffix is 'read_group'.
- CYCLE
- The machine cycle during which the base was read. Suffix is 'cycle'.
- BINNED_CYCLE
- The binned machine cycle. Similar to CYCLE, but binned into 5 evenly spaced ranges across the size of the read. This stratifier may produce confusing results when used on datasets with variable sized reads. Suffix is 'binned_cycle'.
- SOFT_CLIPS
- The number of softclipped bases the read has. Suffix is 'softclipped_bases'.
- INSERT_LENGTH
- The insert-size they came from (taken from the TLEN field.) Suffix is 'insert_length'.
- BASE_QUALITY
- The base quality. Suffix is 'base_quality'.
- MAPPING_QUALITY
- The read's mapping quality. Suffix is 'mapping_quality'.
- MISMATCHES_IN_READ
- The number of bases in the read that mismatch the reference, excluding the current base. This stratifier requires the NM tag. Suffix is 'mismatches_in_read'.
- ONE_BASE_PADDED_CONTEXT
- The current reference base and a one base padded region from the read resulting in a 3-base context. Suffix is 'one_base_padded_context'.
- TWO_BASE_PADDED_CONTEXT
- The current reference base and a two base padded region from the read resulting in a 5-base context. Suffix is 'two_base_padded_context'.
- CONSENSUS
- Whether or not duplicate reads were used to form a consensus read. This stratifier makes use of the aD, bD, and cD tags for duplex consensus reads. If the reads are single index consensus, only the cD tags are used. Suffix is 'consensus'.
- NS_IN_READ
- The number of Ns in the read. Suffix is 'ns_in_read'.
- INSERTIONS_IN_READ
- The number of Insertions in the read cigar. Suffix is 'cigar_elements_I_in_read'.
- DELETIONS_IN_READ
- The number of Deletions in the read cigar. Suffix is 'cigar_elements_D_in_read'.
- INDELS_IN_READ
- The number of INDELs in the read cigar. Suffix is 'indels_in_read'.
- INDEL_LENGTH
- The number of bases in an indel Suffix is 'indel_length'.
Stratifier null
--TMP_DIR
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VCF / -V
VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci.
R String null
--VERBOSITY
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.2.4.0-SNAPSHOT built at Thu, 16 Dec 2021 11:57:48 -0800.
0 comments
Please sign in to leave a comment.