Tabulates pileup metrics for inferring contamination
Category Coverage Analysis
Overview
Summarizes counts of reads that support reference, alternate and other alleles for given sites. Results can be used with CalculateContamination.
The tool requires a common germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF. The tool ignores the filter status of the variant calls in this germline resource.
This tool is featured in the Somatic Short Mutation calling Best Practice Workflow. See Tutorial#11136 for a step-by-step description of the workflow and Article#11127 for an overview of what traditional somatic calling entails. For the latest pipeline scripts, see the Mutect2 WDL scripts directory. In particular, the mutect_resources.wdl script prepares a suitable resource from a larger dataset. An example excerpt is shown.
#CHROM POS ID REF ALT QUAL FILTER INFO chr6 29942512 . G C 2974860 VQSRTrancheSNP99.80to99.90 AF=0.063 chr6 29942517 . C A 2975860 VQSRTrancheSNP99.80to99.90 AF=0.062 chr6 29942525 . G C 2975600 VQSRTrancheSNP99.60to99.80 AF=0.063 chr6 29942547 rs114945359 G C 15667700 PASS AF=0.077
Usage examples
gatk GetPileupSummaries \ -I tumor.bam \ -V common_biallelic.vcf.gz \ -L common_biallelic.vcf.gz \ -O pileups.table
gatk GetPileupSummaries \ -I normal.bam \ -V common_biallelic.vcf.gz \ -L common_biallelic.vcf.gz \ -O pileups.tableAlthough the sites (-L) and variants (-V) resources will often be identical, this need not be the case. For example,
gatk GetPileupSummaries \ -I normal.bam \ -V gnomad.vcf.gz \ -L common_snps.interval_list \ -O pileups.tableattempts to get pileups at a list of common snps and emits output for those sites that are present in gnomAD, using the allele frequencies from gnomAD. Note that the sites may be a subset of the variants, the variants may be a subset of the sites, or they may overlap partially. In all cases pileup summaries are emitted for the overlap and nowhere else. The most common use case in which sites and variants differ is when the variants resources is a large file and the sites is an interval list subset from that file.
GetPileupSummaries tabulates results into six columns as shown below. The alt_count and allele_frequency correspond to the ALT allele in the germline resource.
contig position ref_count alt_count other_alt_count allele_frequency chr6 29942512 9 0 0 0.063 chr6 29942517 13 1 0 0.062 chr6 29942525 13 7 0 0.063 chr6 29942547 36 0 0 0.077
Note the default maximum population AF (--maximum-population-allele-frequency or -max-af) is set to 0.2, which limits the sites the tool considers to those in the variants resource file that have AF of 0.2 or less. Likewise, the default minimum population AF (--minimum-population-allele-frequency or -min-af) is set to 0.01, which limits the sites the tool considers to those in the variants resource file that have AF of 0.01 or more.
Additional Information
Read filters
These Read Filters are automatically applied to the data by the Engine before processing by GetPileupSummaries.
- GoodCigarReadFilter
- NonZeroReferenceLengthAlignmentReadFilter
- PassesVendorQualityCheckReadFilter
- MappedReadFilter
- MappingQualityAvailableReadFilter
- NotDuplicateReadFilter
- PrimaryLineReadFilter
- MateOnSameContigOrNoMappedMateReadFilter
- MappingQualityNotZeroReadFilter
- WellformedReadFilter
GetPileupSummaries specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--input -I |
[] | BAM/SAM/CRAM file containing reads | |
--intervals -L |
[] | One or more genomic intervals over which to operate | |
--output -O |
null | The output table | |
--variant -V |
null | A VCF file containing variants and allele frequencies | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--cloud-index-prefetch-buffer -CIPB |
-1 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset. | |
--cloud-prefetch-buffer -CPB |
40 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). | |
--disable-bam-index-caching -DBIC |
false | If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified. | |
--disable-sequence-dictionary-validation |
false | If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk! | |
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
"" | Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. | |
--help -h |
false | display the help message | |
--interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
--max-depth-per-sample |
0 | Maximum number of reads to retain per sample per locus. Reads above this threshold will be downsampled. Set to 0 to disable. | |
--maximum-population-allele-frequency -max-af |
0.2 | Maximum population allele frequency of sites to consider. | |
--min-mapping-quality -mmq |
50 | Minimum read mapping quality | |
--minimum-population-allele-frequency -min-af |
0.01 | Minimum population allele frequency of sites to consider. A low value increases accuracy at the expense of speed. | |
--reference -R |
null | Reference sequence | |
--sites-only-vcf-output |
false | If true, don't emit genotype fields when writing vcf file output. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--add-output-sam-program-record |
true | If true, adds a PG tag to created SAM/BAM/CRAM files. | |
--add-output-vcf-command-line |
true | If true, adds a command line header line to created VCF files. | |
--create-output-bam-index -OBI |
true | If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file. | |
--create-output-bam-md5 -OBM |
false | If true, create a MD5 digest for any BAM/SAM/CRAM file created | |
--create-output-variant-index -OVI |
true | If true, create a VCF index when writing a coordinate-sorted VCF file. | |
--create-output-variant-md5 -OVM |
false | If true, create a a MD5 digest any VCF file created. | |
--disable-read-filter -DF |
[] | Read filters to be disabled before analysis | |
--disable-tool-default-read-filters |
false | Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on) | |
--exclude-intervals -XL |
[] | One or more genomic intervals to exclude from processing | |
--gatk-config-file |
null | A configuration file to use with the GATK. | |
--interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
--interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
--interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
--lenient -LE |
false | Lenient processing of VCF files | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--read-filter -RF |
[] | Read filters to be applied before analysis | |
--read-index |
[] | Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically. | |
--read-validation-stringency -VS |
SILENT | Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--seconds-between-progress-updates |
10.0 | Output traversal statistics every time this many seconds elapse | |
--sequence-dictionary |
null | Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. | |
--tmp-dir |
null | Temp directory to use. | |
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--add-output-sam-program-record / -add-output-sam-program-record
If true, adds a PG tag to created SAM/BAM/CRAM files.
boolean true
--add-output-vcf-command-line / -add-output-vcf-command-line
If true, adds a command line header line to created VCF files.
boolean true
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--cloud-index-prefetch-buffer / -CIPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
int -1 [ [ -∞ ∞ ] ]
--cloud-prefetch-buffer / -CPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable).
int 40 [ [ -∞ ∞ ] ]
--create-output-bam-index / -OBI
If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
boolean true
--create-output-bam-md5 / -OBM
If true, create a MD5 digest for any BAM/SAM/CRAM file created
boolean false
--create-output-variant-index / -OVI
If true, create a VCF index when writing a coordinate-sorted VCF file.
boolean true
--create-output-variant-md5 / -OVM
If true, create a a MD5 digest any VCF file created.
boolean false
--disable-bam-index-caching / -DBIC
If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
boolean false
--disable-read-filter / -DF
Read filters to be disabled before analysis
List[String] []
--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation
If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
boolean false
--disable-tool-default-read-filters / -disable-tool-default-read-filters
Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
boolean false
--exclude-intervals / -XL
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite).
This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals).
List[String] []
--gatk-config-file / NA
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays / NA
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
String ""
--help / -h
display the help message
boolean false
--input / -I
BAM/SAM/CRAM file containing reads
R List[String] []
--interval-exclusion-padding / -ixp
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
- ALL
- OVERLAPPING_ONLY
IntervalMergingRule ALL
--interval-padding / -ip
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-set-rule / -isr
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
- UNION
- Take the union of all intervals
- INTERSECTION
- Take the intersection of intervals (the subset that overlaps all intervals specified)
IntervalSetRule UNION
--intervals / -L
One or more genomic intervals over which to operate
R List[String] []
--lenient / -LE
Lenient processing of VCF files
boolean false
--max-depth-per-sample / -max-depth-per-sample
Maximum number of reads to retain per sample per locus. Reads above this threshold will be downsampled. Set to 0 to disable.
int 0 [ [ -∞ ∞ ] ]
--maximum-population-allele-frequency / -max-af
Maximum population allele frequency of sites to consider.
double 0.2 [ [ -∞ ∞ ] ]
--min-mapping-quality / -mmq
Minimum read mapping quality
int 50 [ [ -∞ ∞ ] ]
--minimum-population-allele-frequency / -min-af
Minimum population allele frequency of sites to consider. A low value increases accuracy at the expense of speed.
double 0.01 [ [ -∞ ∞ ] ]
--output / -O
The output table
R File null
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--read-filter / -RF
Read filters to be applied before analysis
List[String] []
--read-index / -read-index
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
List[String] []
--read-validation-stringency / -VS
Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency SILENT
--reference / -R
Reference sequence
String null
--seconds-between-progress-updates / -seconds-between-progress-updates
Output traversal statistics every time this many seconds elapse
double 10.0 [ [ -∞ ∞ ] ]
--sequence-dictionary / -sequence-dictionary
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
String null
--showHidden / -showHidden
display hidden arguments
boolean false
--sites-only-vcf-output / NA
If true, don't emit genotype fields when writing vcf file output.
boolean false
--tmp-dir / NA
Temp directory to use.
GATKPathSpecifier null
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--variant / -V
A VCF file containing variants and allele frequencies
R FeatureInput[VariantContext] null
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.1.4.1 built at Thu, 5 Dec 2019 09:51:56 -0500.
1 comment
In case of wanting to detect somatic variants of tumor samples, is it convenient to provide the tool (in -V or -L) with database files specialized in somatic mutations such as COSMIC?
Please sign in to leave a comment.