Collects hybrid-selection (HS) metrics for a SAM or BAM file.
This tool takes a SAM/BAM file input and collects metrics that are specific for sequence datasets generated through hybrid-selection. Hybrid-selection (HS) is the most commonly used technique to capture exon-specific sequences for targeted sequencing experiments such as exome sequencing; for more information, please see the corresponding GATK Dictionary entry.
This tool requires an aligned SAM or BAM file as well as bait and target interval files in Picard interval_list format. You should use the bait and interval files that correspond to the capture kit that was used to generate the capture libraries for sequencing, which can generally be obtained from the kit manufacturer. If the baits and target intervals are provided in BED format, you can convert them to the Picard interval_list format using Picard's BedToInterval tool.
If a reference sequence is provided, this program will calculate both AT_DROPOUT and GC_DROPOUT metrics. Dropout metrics are an attempt to measure the reduced representation of reads, in regions that deviate from 50% G/C content. This reduction in the number of aligned reads is due to the increased numbers of errors associated with sequencing regions with excessive or deficient numbers of G/C bases, ultimately leading to poor mapping efficiencies and lowcoverage in the affected regions.
If you are interested in getting G/C content and mean sequence depth information for every target interval, use the PER_TARGET_COVERAGE option.
Note: Metrics labeled as percentages are actually expressed as fractions!
Usage Example:
java -jar picard.jar CollectHsMetrics \
I=input_reads.bam \
O=output_hs_metrics.txt \
R=reference.fasta \
BAIT_INTERVALS=bait.interval_list \
TARGET_INTERVALS=target.interval_list
Please see CollectHsMetrics for detailed descriptions of the output metrics produced by this tool.
Category Diagnostics and Quality Control
Overview
This tool takes a SAM/BAM file input and collects metrics that are specific for sequence datasets generated through hybrid-selection. Hybrid-selection (HS) is the most commonly used technique to capture exon-specific sequences for targeted sequencing experiments such as exome sequencing; for more information, please see the corresponding GATK Dictionary entry.
This tool requires an aligned SAM or BAM file as well as bait and target interval files in Picard interval_list format. You should use the bait and interval files that correspond to the capture kit that was used to generate the capture libraries for sequencing, which can generally be obtained from the kit manufacturer. If the baits and target intervals are provided in BED format, you can convert them to the Picard interval_list format using Picard's BedToInterval tool.
If a reference sequence is provided, this program will calculate both AT_DROPOUT and GC_DROPOUT metrics. Dropout metrics are an attempt to measure the reduced representation of reads, in regions that deviate from 50% G/C content. This reduction in the number of aligned reads is due to the increased numbers of errors associated with sequencing regions with excessive or deficient numbers of G/C bases, ultimately leading to poor mapping efficiencies and low coverage in the affected regions.
If you are interested in getting G/C content and mean sequence depth information for every target interval, use the PER_TARGET_COVERAGE option.
Note: Metrics labeled as percentages are actually expressed as fractions!
Usage Example:
java -jar picard.jar CollectHsMetrics \\
I=input_reds.bam \\
O=output_hs_metrics.txt \\
" R=reference.fasta \\
BAIT_INTERVALS=bait.interval_list \\
TARGET_INTERVALS=target.interval_list
Please see CollectHsMetrics for detailed descriptions of the output metrics produced by this tool.
See HsMetricCollector and CollectTargetedMetrics for more details.
CollectHsMetrics (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--BAIT_INTERVALS -BI |
[] | An interval list file that contains the locations of the baits used. | |
--INPUT -I |
null | An aligned SAM or BAM file. | |
--OUTPUT -O |
null | The output file to write the metrics to. | |
--TARGET_INTERVALS -TI |
[] | An interval list file that contains the locations of the targets. | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--BAIT_SET_NAME -N |
null | Bait set name. If not provided it is inferred from the filename of the bait intervals. | |
--CLIP_OVERLAPPING_READS |
true | True if we are to clip overlapping reads, false otherwise. | |
--COVERAGE_CAP -covMax |
200 | Parameter to set a max coverage limit for Theoretical Sensitivity calculations. Default is 200. | |
--help -h |
false | display the help message | |
--METRIC_ACCUMULATION_LEVEL -LEVEL |
[ALL_READS] | The level(s) at which to accumulate metrics. | |
--MINIMUM_BASE_QUALITY -Q |
20 | Minimum base quality for a base to contribute coverage. | |
--MINIMUM_MAPPING_QUALITY -MQ |
20 | Minimum mapping quality for a read to contribute coverage. | |
--NEAR_DISTANCE |
250 | The maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered 'near probe' and included in percent selected. | |
--PER_BASE_COVERAGE |
null | An optional file to output per base coverage information to. The per-base file contains one line per target base and can grow very large. It is not recommended for use with large target sets. | |
--PER_TARGET_COVERAGE |
null | An optional file to output per target coverage information to. | |
--SAMPLE_SIZE |
10000 | Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--BAIT_INTERVALS / -BI
An interval list file that contains the locations of the baits used.
R List[File] []
--BAIT_SET_NAME / -N
Bait set name. If not provided it is inferred from the filename of the bait intervals.
String null
--CLIP_OVERLAPPING_READS / NA
True if we are to clip overlapping reads, false otherwise.
boolean true
--COMPRESSION_LEVEL / NA
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--COVERAGE_CAP / -covMax
Parameter to set a max coverage limit for Theoretical Sensitivity calculations. Default is 200.
int 200 [ [ -∞ ∞ ] ]
--CREATE_INDEX / NA
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
--CREATE_MD5_FILE / NA
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--GA4GH_CLIENT_SECRETS / NA
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--INPUT / -I
An aligned SAM or BAM file.
R File null
--MAX_RECORDS_IN_RAM / NA
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--METRIC_ACCUMULATION_LEVEL / -LEVEL
The level(s) at which to accumulate metrics.
Set[MetricAccumulationLevel] [ALL_READS]
--MINIMUM_BASE_QUALITY / -Q
Minimum base quality for a base to contribute coverage.
int 20 [ [ -∞ ∞ ] ]
--MINIMUM_MAPPING_QUALITY / -MQ
Minimum mapping quality for a read to contribute coverage.
int 20 [ [ -∞ ∞ ] ]
--NEAR_DISTANCE / NA
The maximum distance between a read and the nearest probe/bait/amplicon for the read to be considered 'near probe' and included in percent selected.
int 250 [ [ -∞ ∞ ] ]
--OUTPUT / -O
The output file to write the metrics to.
R File null
--PER_BASE_COVERAGE / NA
An optional file to output per base coverage information to. The per-base file contains one line per target base and can grow very large. It is not recommended for use with large target sets.
File null
--PER_TARGET_COVERAGE / NA
An optional file to output per target coverage information to.
File null
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--REFERENCE_SEQUENCE / -R
Reference sequence file.
File null
--SAMPLE_SIZE / NA
Sample Size used for Theoretical Het Sensitivity sampling. Default is 10000.
int 10000 [ [ -∞ ∞ ] ]
--showHidden / -showHidden
display hidden arguments
boolean false
--TARGET_INTERVALS / -TI
An interval list file that contains the locations of the targets.
R List[File] []
--TMP_DIR / NA
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY / NA
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY / NA
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.0.5.1 built at 25-59-2019 01:59:53.
0 comments
Please sign in to leave a comment.