Collect metrics to quantify single-base sequencing artifacts.
This tool examines two sources of sequencing errors associated with hybrid selection protocols. These errors are divided into two broad categories, pre-adapter and bait-bias. Pre-adapter errors can arise from laboratory manipulations of a nucleic acid sample e.g. shearing and occur prior to the ligation of adapters for PCR amplification (hence the name pre-adapter).
Bait-bias artifacts occur during or after the target selection step, and correlate with substitution rates that are 'biased', or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, during the target selection step, a (G>T) artifact might result in a higher substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the 'G-Ref' artifact.
For additional information on these types of artifacts, please see the corresponding GATK dictionary entries on bait-bias and pre-adapter artifacts.
This tool produces four files; summary and detail metrics files for both pre-adapter and bait-bias artifacts. The detailed metrics show the error rates for each type of base substitution within every possible triplet base configuration. Error rates associated with these substitutions are Phred-scaled and provided as quality scores, the lower the value, the more likely it is that an alternate base call is due to an artifact. The summary metrics provide likelihood information on the 'worst-case' errors.
Usage example:
java -jar picard.jar CollectSequencingArtifactMetrics \Please see the metrics at the following links PreAdapterDetailMetrics, PreAdapterSummaryMetrics, BaitBiasDetailMetrics, and BaitBiasSummaryMetrics for complete descriptions of the output metrics produced by this tool.
I=input.bam \
O=artifact_metrics.txt \
R=reference_sequence.fasta
Category Diagnostics and Quality Control
Overview
Quantify substitution errors caused by mismatched base pairings during various stages of sample / library prep. We measure two distinct error types - artifacts that are introduced before the addition of the read1/read2 adapters ("pre adapter") and those that are introduced after target selection ("bait bias"). For each of these, we provide summary metrics as well as detail metrics broken down by reference context (the ref bases surrounding the substitution event). For a deeper explanation, see Costello et al. 2013: http://www.ncbi.nlm.nih.gov/pubmed/23303777CollectSequencingArtifactMetrics (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--INPUT -I |
null | Input SAM or BAM file. | |
--OUTPUT -O |
null | File to write the output to. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--ASSUME_SORTED -AS |
true | If true (default), then the sort order in the header file will be ignored. | |
--CONTEXT_SIZE |
1 | The number of context bases to include on each side of the assayed base. | |
--CONTEXTS_TO_PRINT |
[] | If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration. | |
--DB_SNP |
null | VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis. | |
--FILE_EXTENSION -EXT |
null | Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null | |
--help -h |
false | display the help message | |
--INCLUDE_DUPLICATES -DUPES |
false | Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well. | |
--INCLUDE_NON_PF_READS -NON_PF |
false | Whether or not to include non-PF reads. | |
--INCLUDE_UNPAIRED -UNPAIRED |
false | Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored. | |
--INTERVALS |
null | An optional list of intervals to restrict analysis to. | |
--MAXIMUM_INSERT_SIZE -MAX_INS |
600 | The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum. | |
--MINIMUM_INSERT_SIZE -MIN_INS |
60 | The minimum insert size for a read to be included in analysis. | |
--MINIMUM_MAPPING_QUALITY -MQ |
30 | The minimum mapping quality score for a base to be included in analysis. | |
--MINIMUM_QUALITY_SCORE -Q |
20 | The minimum base quality score for a base to be included in analysis. | |
--STOP_AFTER |
0 | Stop after processing N reads, mainly for debugging. | |
--TANDEM_READS -TANDEM |
false | Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction. | |
--USE_OQ |
true | When available, use original quality scores for filtering. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--ASSUME_SORTED / -AS
If true (default), then the sort order in the header file will be ignored.
boolean true
--COMPRESSION_LEVEL / NA
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CONTEXT_SIZE / NA
The number of context bases to include on each side of the assayed base.
int 1 [ [ -∞ ∞ ] ]
--CONTEXTS_TO_PRINT / NA
If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration.
Set[String] []
--CREATE_INDEX / NA
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
--CREATE_MD5_FILE / NA
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--DB_SNP / NA
VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis.
File null
--FILE_EXTENSION / -EXT
Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null
String null
--GA4GH_CLIENT_SECRETS / NA
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--INCLUDE_DUPLICATES / -DUPES
Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well.
boolean false
--INCLUDE_NON_PF_READS / -NON_PF
Whether or not to include non-PF reads.
boolean false
--INCLUDE_UNPAIRED / -UNPAIRED
Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored.
boolean false
--INPUT / -I
Input SAM or BAM file.
R File null
--INTERVALS / NA
An optional list of intervals to restrict analysis to.
File null
--MAX_RECORDS_IN_RAM / NA
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MAXIMUM_INSERT_SIZE / -MAX_INS
The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum.
int 600 [ [ -∞ ∞ ] ]
--MINIMUM_INSERT_SIZE / -MIN_INS
The minimum insert size for a read to be included in analysis.
int 60 [ [ -∞ ∞ ] ]
--MINIMUM_MAPPING_QUALITY / -MQ
The minimum mapping quality score for a base to be included in analysis.
int 30 [ [ -∞ ∞ ] ]
--MINIMUM_QUALITY_SCORE / -Q
The minimum base quality score for a base to be included in analysis.
int 20 [ [ -∞ ∞ ] ]
--OUTPUT / -O
File to write the output to.
R File null
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--REFERENCE_SEQUENCE / -R
Reference sequence file.
R File null
--showHidden / -showHidden
display hidden arguments
boolean false
--STOP_AFTER / NA
Stop after processing N reads, mainly for debugging.
long 0 [ [ -∞ ∞ ] ]
--TANDEM_READS / -TANDEM
Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction.
boolean false
--TMP_DIR / NA
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--USE_OQ / NA
When available, use original quality scores for filtering.
boolean true
--VALIDATION_STRINGENCY / NA
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY / NA
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.0.3.0 built at 02-29-2019 02:29:33.
0 comments
Please sign in to leave a comment.