Program to collect error metrics on bases stratified in various ways.
Sequencing errors come in different 'flavors'. For example, some occur during sequencing while others happen during library construction, prior to the sequencing. They may be correlated with various aspect of the sequencing experiment: position in the read, base context, length of insert and so on.
This program collects two different kinds of error metrics (one which attempts to distinguish between pre- and post- sequencer errors, and on which doesn't) and a collation of 'stratifiers' each of which assigns bases into various bins. The stratifiers can be used together to generate a composite stratification.
For example:
The BASE_QUALITY stratifier will place bases in bins according to their declared base quality. The READ_ORDINALITY stratifier will place bases in one of two bins depending on whether their read is 'first' or 'second'. One could generate a composite stratifier BASE_QUALITY:READ_ORDINALITY which will do both stratifications as the same time.
The resulting metric file will be named according to a provided prefix and a suffix which is generated automatically according to the error metric. The tool can collect multiple metrics in a single pass and there should be hardly any performance loss when specifying multiple metrics at the same time; the default includes a large collection of metrics.
To estimate the error rate the tool assumes that all differences from the reference are errors. For this to be a reasonable assumption the tool needs to know the sites at which the sample is actually polymorphic and a confidence interval where the user is relatively certain that the polymorphic sites are known and accurate. These two inputs are provided as a VCF and INTERVALS. The program will only process sites that are in the intersection of the interval lists in the INTERVALS argument as long as they are not polymorphic in the VCF.
Category Diagnostics and Quality Control
Overview
Program to collect error metrics on bases stratified in various ways.CollectSamErrorMetrics (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--INPUT -I |
null | Input SAM or BAM file. | |
--OUTPUT -O |
null | Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
--VCF -V |
null | VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci. | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--ERROR_METRICS |
[ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT, INDEL_ERROR] | Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE. | |
--ERROR_VALUE |
null | A fake argument used to show the options of ERROR (in ERROR_METRICS). | |
--help -h |
false | display the help message | |
--INTERVAL_ITERATOR |
false | Iterate through the file assuming it consists of a pre-created subset interval of the full genome. This enables fast processing of files with reads at disperate parts of the genome. Requires that the provided VCF file is indexed. | |
--INTERVALS -L |
[] | Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will intersect inputs if multiple are given. | |
--LONG_HOMOPOLYMER -LH |
6 | Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier. | |
--MAX_LOCI -MAX |
0 | Maximum number of loci to process (or unlimited if 0). | |
--MIN_BASE_Q -BQ |
20 | Minimum base quality to include base. | |
--MIN_MAPPING_Q -MQ |
20 | Minimum mapping quality to include read. | |
--PRIOR_Q -PE |
30 | The prior error, in phred-scale (used for calculating empirical error rates). | |
--PROBABILITY -P |
1.0 | The probability of selecting a locus for analysis (for downsampling). | |
--PROGRESS_STEP_INTERVAL |
100000 | The interval between which progress will be displayed. | |
--STRATIFIER_VALUE |
null | A fake argument used to show the options of STRATIFIER (in ERROR_METRICS). | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--COMPRESSION_LEVEL / NA
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX / NA
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
--CREATE_MD5_FILE / NA
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--ERROR_METRICS / NA
Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE.
List[String] [ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT, INDEL_ERROR]
--ERROR_VALUE / NA
A fake argument used to show the options of ERROR (in ERROR_METRICS).
The --ERROR_VALUE argument is an enumerated type (ErrorType), which can have one of the following values:
- ERROR
- OVERLAPPING_ERROR
- INDEL_ERROR
ErrorType null
--GA4GH_CLIENT_SECRETS / NA
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--INPUT / -I
Input SAM or BAM file.
R String null
--INTERVAL_ITERATOR / NA
Iterate through the file assuming it consists of a pre-created subset interval of the full genome. This enables fast processing of files with reads at disperate parts of the genome. Requires that the provided VCF file is indexed.
boolean false
--INTERVALS / -L
Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will intersect inputs if multiple are given.
List[File] []
--LONG_HOMOPOLYMER / -LH
Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier.
int 6 [ [ -∞ ∞ ] ]
--MAX_LOCI / -MAX
Maximum number of loci to process (or unlimited if 0).
long 0 [ [ -∞ ∞ ] ]
--MAX_RECORDS_IN_RAM / NA
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MIN_BASE_Q / -BQ
Minimum base quality to include base.
int 20 [ [ -∞ ∞ ] ]
--MIN_MAPPING_Q / -MQ
Minimum mapping quality to include read.
int 20 [ [ -∞ ∞ ] ]
--OUTPUT / -O
Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE.
R File null
--PRIOR_Q / -PE
The prior error, in phred-scale (used for calculating empirical error rates).
int 30 [ [ -∞ ∞ ] ]
--PROBABILITY / -P
The probability of selecting a locus for analysis (for downsampling).
double 1.0 [ [ -∞ ∞ ] ]
--PROGRESS_STEP_INTERVAL / NA
The interval between which progress will be displayed.
int 100000 [ [ -∞ ∞ ] ]
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--REFERENCE_SEQUENCE / -R
Reference sequence file.
R File null
--showHidden / -showHidden
display hidden arguments
boolean false
--STRATIFIER_VALUE / NA
A fake argument used to show the options of STRATIFIER (in ERROR_METRICS).
The --STRATIFIER_VALUE argument is an enumerated type (Stratifier), which can have one of the following values:
- ALL
- GC_CONTENT
- READ_ORDINALITY
- READ_BASE
- READ_DIRECTION
- PAIR_ORIENTATION
- PAIR_PROPERNESS
- REFERENCE_BASE
- PRE_DINUC
- POST_DINUC
- HOMOPOLYMER_LENGTH
- HOMOPOLYMER
- BINNED_HOMOPOLYMER
- FLOWCELL_TILE
- READ_GROUP
- CYCLE
- BINNED_CYCLE
- SOFT_CLIPS
- INSERT_LENGTH
- BASE_QUALITY
- MAPPING_QUALITY
- MISMATCHES_IN_READ
- ONE_BASE_PADDED_CONTEXT
- TWO_BASE_PADDED_CONTEXT
- CONSENSUS
- NS_IN_READ
- INSERTIONS_IN_READ
- DELETIONS_IN_READ
- INDELS_IN_READ
- INDEL_LENGTH
Stratifier null
--TMP_DIR / NA
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY / NA
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VCF / -V
VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci.
R String null
--VERBOSITY / NA
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.1.6.0-SNAPSHOT built at Thu, 2 Apr 2020 14:54:17 -0400.
0 comments
Please sign in to leave a comment.