Determine callable status of loci
Category Coverage Analysis
Overview
Collect statistics on callable, uncallable, poorly mapped, and other parts of the genomeA very common question about a NGS set of reads is what areas of the genome are considered callable. This tool considers the coverage at each locus and emits either a per base state or a summary interval BED file that partitions the genomic intervals into the following callable states:
- REF_N
- The reference base was an N, which is not considered callable the GATK
- PASS
- The base satisfied the min. depth for calling but had less than maxDepth to avoid having EXCESSIVE_COVERAGE
- NO_COVERAGE
- Absolutely no reads were seen at this locus, regardless of the filtering parameters
- LOW_COVERAGE
- There were fewer than min. depth bases at the locus, after applying filters
- EXCESSIVE_COVERAGE
- More than --max-depth read at the locus, indicating some sort of mapping problem
- POOR_MAPPING_QUALITY
- More than --max-fraction-of-reads-with-low-mapq at the locus, indicating a poor mapping quality of the reads
Input
A BAM file containing exactly one sample.
Output
A file with the callable status covering each base and a table of callable status x count of all examined bases
Usage example
gatk CallableLoci \ -I myreads.bam \ -R myreference.fasta \ -O callable_status.bed \ --summary table.txtwould produce a BED file that looks like:
20 10000000 10000864 PASS 20 10000865 10000985 POOR_MAPPING_QUALITY 20 10000986 10001138 PASS 20 10001139 10001254 POOR_MAPPING_QUALITY 20 10001255 10012255 PASS 20 10012256 10012259 POOR_MAPPING_QUALITY 20 10012260 10012263 PASS 20 10012264 10012328 POOR_MAPPING_QUALITY 20 10012329 10012550 PASS 20 10012551 10012551 LOW_COVERAGE 20 10012552 10012554 PASS 20 10012555 10012557 LOW_COVERAGE 20 10012558 10012558 PASSas well as a summary table that looks like:
state nBases REF_N 0 PASS 996046 NO_COVERAGE 121 LOW_COVERAGE 928 EXCESSIVE_COVERAGE 0 POOR_MAPPING_QUALITY 2906@author Mark DePristo / Jonn Smith @since May 7, 2010 / Nov 1, 2024
Additional Information
Read filters
These Read Filters are automatically applied to the data by the Engine before processing by CallableLoci.
- GoodCigarReadFilter
- PassesVendorQualityCheckReadFilter
- MappedReadFilter
- NotDuplicateReadFilter
- PrimaryLineReadFilter
- WellformedReadFilter
CallableLoci specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--input -I |
BAM/SAM/CRAM file containing reads | ||
--output -O |
Output file (BED or per-base format) | ||
--reference -R |
Reference sequence file | ||
--summary |
Name of file for output summary | ||
Optional Tool Arguments | |||
--arguments_file |
read one or more arguments files and add them to the command line | ||
--cloud-index-prefetch-buffer -CIPB |
-1 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset. | |
--cloud-prefetch-buffer -CPB |
40 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). | |
--disable-bam-index-caching -DBIC |
false | If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified. | |
--disable-sequence-dictionary-validation |
false | If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk! | |
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
--help -h |
false | display the help message | |
--interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
--intervals -L |
One or more genomic intervals over which to operate | ||
--max-depth |
Maximum read depth before a locus is considered poorly mapped | ||
--max-depth-per-sample |
0 | Maximum number of reads to retain per sample per locus. Reads above this threshold will be downsampled. Set to 0 to disable. | |
--max-fraction-of-reads-with-low-mapq -frlmq |
0.1 | If the fraction of reads at a base with low mapping quality exceeds this value, the site may be poorly mapped | |
--max-low-mapq -mlmq |
1 | Maximum value for MAPQ to be considered a problematic mapped read | |
--min-base-quality -mbq |
20 | Minimum quality of bases to count towards depth | |
--min-mapping-quality -mmq |
10 | Minimum mapping quality of reads to count towards depth | |
--sites-only-vcf-output |
false | If true, don't emit genotype fields when writing vcf file output. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--add-output-sam-program-record |
true | If true, adds a PG tag to created SAM/BAM/CRAM files. | |
--add-output-vcf-command-line |
true | If true, adds a command line header line to created VCF files. | |
--create-output-bam-index -OBI |
true | If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file. | |
--create-output-bam-md5 -OBM |
false | If true, create a MD5 digest for any BAM/SAM/CRAM file created | |
--create-output-variant-index -OVI |
true | If true, create a VCF index when writing a coordinate-sorted VCF file. | |
--create-output-variant-md5 -OVM |
false | If true, create a a MD5 digest any VCF file created. | |
--disable-read-filter -DF |
Read filters to be disabled before analysis | ||
--disable-tool-default-read-filters |
false | Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on) | |
--exclude-intervals -XL |
One or more genomic intervals to exclude from processing | ||
--gatk-config-file |
A configuration file to use with the GATK. | ||
--interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
--interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
--interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
--inverted-read-filter -XRF |
Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters). | ||
--lenient -LE |
false | Lenient processing of VCF files | |
--max-variants-per-shard |
0 | If non-zero, partitions VCF output into shards, each containing up to the given number of records. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--read-filter -RF |
Read filters to be applied before analysis | ||
--read-index |
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically. | ||
--read-validation-stringency -VS |
SILENT | Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--seconds-between-progress-updates |
10.0 | Output traversal statistics every time this many seconds elapse | |
--sequence-dictionary |
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. | ||
--tmp-dir |
Temp directory to use. | ||
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--format |
BED | Output format | |
--min-depth |
4 | Minimum QC+ read depth before a locus is considered callable | |
--min-depth-for-low-mapq -mdflmq |
10 | Minimum read depth before a locus is considered a potential candidate for poorly mapped | |
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--add-output-sam-program-record / -add-output-sam-program-record
If true, adds a PG tag to created SAM/BAM/CRAM files.
boolean true
--add-output-vcf-command-line / -add-output-vcf-command-line
If true, adds a command line header line to created VCF files.
boolean true
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--cloud-index-prefetch-buffer / -CIPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
int -1 [ [ -∞ ∞ ] ]
--cloud-prefetch-buffer / -CPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable).
int 40 [ [ -∞ ∞ ] ]
--create-output-bam-index / -OBI
If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
boolean true
--create-output-bam-md5 / -OBM
If true, create a MD5 digest for any BAM/SAM/CRAM file created
boolean false
--create-output-variant-index / -OVI
If true, create a VCF index when writing a coordinate-sorted VCF file.
boolean true
--create-output-variant-md5 / -OVM
If true, create a a MD5 digest any VCF file created.
boolean false
--disable-bam-index-caching / -DBIC
If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
boolean false
--disable-read-filter / -DF
Read filters to be disabled before analysis
List[String] []
--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation
If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
boolean false
--disable-tool-default-read-filters / -disable-tool-default-read-filters
Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
boolean false
--exclude-intervals / -XL
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals). strings gathered from the command line -XL argument to be parsed into intervals to exclude
List[String] []
--format / -format
Output format
The output of this tool will be written in this format. The recommended option is BED.
The --format argument is an enumerated type (OutputFormat), which can have one of the following values:
- BED
- STATE_PER_BASE
OutputFormat BED
--gatk-config-file
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
--help / -h
display the help message
boolean false
--input / -I
BAM/SAM/CRAM file containing reads
R List[GATKPath] []
--interval-exclusion-padding / -ixp
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
- ALL
- OVERLAPPING_ONLY
IntervalMergingRule ALL
--interval-padding / -ip
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-set-rule / -isr
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
- UNION
- Take the union of all intervals
- INTERSECTION
- Take the intersection of intervals (the subset that overlaps all intervals specified)
IntervalSetRule UNION
--intervals / -L
One or more genomic intervals over which to operate
List[String] []
--inverted-read-filter / -XRF
Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters).
List[String] []
--lenient / -LE
Lenient processing of VCF files
boolean false
--max-depth / -max-depth
Maximum read depth before a locus is considered poorly mapped
If the QC+ depth exceeds this value the site is considered to have EXCESSIVE_DEPTH
Integer null
--max-depth-per-sample / -max-depth-per-sample
Maximum number of reads to retain per sample per locus. Reads above this threshold will be downsampled. Set to 0 to disable.
int 0 [ [ -∞ ∞ ] ]
--max-fraction-of-reads-with-low-mapq / -frlmq
If the fraction of reads at a base with low mapping quality exceeds this value, the site may be poorly mapped
If the number of reads at this site is greater than minDepthForLowMAPQ and the fraction of reads with low mapping quality
exceeds this fraction then the site has POOR_MAPPING_QUALITY.
double 0.1 [ [ -∞ ∞ ] ]
--max-low-mapq / -mlmq
Maximum value for MAPQ to be considered a problematic mapped read
The gap between this value and mmq are reads that are not sufficiently well mapped for calling but
aren't indicative of mapping problems. For example, if maxLowMAPQ = 1 and mmq = 20, then reads with
MAPQ == 0 are poorly mapped, MAPQ >= 20 are considered as contributing to calling, where
reads with MAPQ >= 1 and 20 are not bad in and of themselves but aren't sufficiently good to contribute to
calling. In effect this reads are invisible, driving the base to the NO_ or LOW_COVERAGE states
int 1 [ [ 0 255 ] ]
--max-variants-per-shard
If non-zero, partitions VCF output into shards, each containing up to the given number of records.
int 0 [ [ 0 ∞ ] ]
--min-base-quality / -mbq
Minimum quality of bases to count towards depth
Bases with less than minBaseQuality are viewed as not sufficiently high quality to contribute to the PASS state
int 20 [ [ 0 255 ] ]
--min-depth / -min-depth
Minimum QC+ read depth before a locus is considered callable
If the number of QC+ bases (on reads with MAPQ > minMappingQuality and with base quality > minBaseQuality) exceeds this
value and is less than maxDepth the site is considered PASS.
int 4 [ [ 0 ∞ ] ]
--min-depth-for-low-mapq / -mdflmq
Minimum read depth before a locus is considered a potential candidate for poorly mapped
We don't want to consider a site as POOR_MAPPING_QUALITY just because it has two reads, and one is MAPQ. We
won't assign a site to the POOR_MAPPING_QUALITY state unless there are at least minDepthForLowMAPQ reads
covering the site.
int 10 [ [ -∞ ∞ ] ]
--min-mapping-quality / -mmq
Minimum mapping quality of reads to count towards depth
Reads with MAPQ > minMappingQuality are treated as usable for variation detection, contributing to the PASS
state.
int 10 [ [ 0 255 ] ]
--output / -O
Output file (BED or per-base format)
R GATKPath null
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--read-filter / -RF
Read filters to be applied before analysis
List[String] []
--read-index / -read-index
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
List[GATKPath] []
--read-validation-stringency / -VS
Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency SILENT
--reference / -R
Reference sequence file
R GATKPath null
--seconds-between-progress-updates / -seconds-between-progress-updates
Output traversal statistics every time this many seconds elapse
double 10.0 [ [ -∞ ∞ ] ]
--sequence-dictionary / -sequence-dictionary
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
GATKPath null
--showHidden / -showHidden
display hidden arguments
boolean false
--sites-only-vcf-output
If true, don't emit genotype fields when writing vcf file output.
boolean false
--summary
Name of file for output summary
Callable loci summary counts will be written to this file.
R GATKPath null
--tmp-dir
Temp directory to use.
GATKPath null
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.6.2.0 built at Sun, 13 Apr 2025 15:34:15 -0400.
0 comments
Please sign in to leave a comment.