Evaluate gene expression from RNA-seq reads aligned to genome.
Category Coverage Analysis
Overview
Evaluate gene expression from RNA-seq reads aligned to genome.This tool counts fragments to evaluate gene expression from RNA-seq reads aligned to the genome. Features to evaluate expression over are defined in an input annotation file in gff3 fomat (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). Output is a tsv listing sense and antisense expression for all stranded grouping features, and expression (labeled as sense) for all unstranded grouping features.
Input
- BAM file of RNA-seq reads
- Gff3 file of feature annotations
Output
TSV file of gene expression
Usage Examples
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv
Reads are assumed to be paired-end. Reads which are in a "good pair" are counted once together as a single fragment. Reads which are not in a "good pair" are each counted separately. Fragment size is not explicitly considered in the "good pair" decision, instead we rely on the aligner to take that into consideration intelligently when deciding whether to set the properly paired flag. A "good pair" is defined as:
Reads can be either from spliced or unspliced RNA. If from spliced RNA, alignment blocks of reads are taken as their coverage. If from unspliced RNA, the entire region from the most 5' base of either read to most 3' base of either read is taken as the fragment coverage. Splice status is set through the command line. By default, splice status is taken to be "spliced."
For spliced RNA
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv
For unspliced RNA
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --unspliced
Gene expression is aggregated over "groupingType" features, and overlap of fragments with features is determined by "overlapType" features. By default, "groupingType" is genes, and "overlapType" is exons. Additional grouping_types and overlap_types can be used as well.
Use gene and pseudogene as grouping types
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --grouping-type gene --grouping-type pseudogene
The read orientation can be set using the read-strand argument. By default, it is assumed that read1 will be on the forward strand, and read2 on the reverse strand.
Set read strands to read1 reverse, read2 forward
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --read-strands REVERSE-FORWARD
Set read1 and read2 to both be forward strand
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --read-strands FORWARD-FORWARD
Multi-overlapping fragments (fragment alignments which overlap multiple grouping features) can be handled in two ways. Equal weight can be given to each grouping feature, in which case each is given weight 1/N, for N overlapping grouping features.
Equal weight for multi-overlapping fragments
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --multi-overlap-method EQUAL
Multi-overlapping fragments can also have weight distributed according to how much of the fragment overlaps each feature. In this case, each grouping feature is given an unnormalized weight corresponding to the fraction of the fragment that overlaps its overlapping features. A "non-overlapping" option is also given an unnormalized weight corresponding the the fraction of the fragment that overlaps no features. Final weights for each feature are found by normalizing by the sum of unnormalized weights. This is the default behavior.
Proportional weight for multi-overlapping fragments
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --multi-overlap-method PROPORTIONAL
Multi-mapping fragments (fragments whose reads map to multiple locations) can also be handled in two ways. They can be ignored, in which case only fragments with a single mapping are counted. This is default behavior.
Ignore multi-mapping fragments
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --multi-map-method IGNORE
Multi-mapping fragments can alternatively have their weight distributed equally between the different alignments. If run in this setting, minimum-mapping-quality will be set to 0, regardless of its value on the command line.
Equally weight each alignment of multi-mapping fragments
gatk GeneExpressionEvaluation -I input.bam -G geneAnnotations.gff3 -O output.tsv --multi-map-method EQUAL
The number of mapping for a particular fragment is extracted via the NH tag.
Currently this tool only works on a single sample BAM. If a readgroup header line indicates multiple samples in the same BAM, the tool will throw an exception.
Additional Information
Read filters
These Read Filters are automatically applied to the data by the Engine before processing by GeneExpressionEvaluation.
- ValidAlignmentStartReadFilter
- NonZeroReferenceLengthAlignmentReadFilter
- MatchingBasesAndQualsReadFilter
- SeqIsStoredReadFilter
- AlignmentAgreesWithHeaderReadFilter
- MappedReadFilter
- ValidAlignmentEndReadFilter
- NotDuplicateReadFilter
- MappingQualityReadFilter
- ReadLengthEqualsCigarLengthReadFilter
GeneExpressionEvaluation specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--gff-file -G |
Gff3 file containing feature annotations | ||
--input -I |
BAM/SAM/CRAM file containing reads | ||
--output -O |
Output file for gene expression. | ||
Optional Tool Arguments | |||
--arguments_file |
read one or more arguments files and add them to the command line | ||
--cloud-index-prefetch-buffer -CIPB |
-1 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset. | |
--cloud-prefetch-buffer -CPB |
40 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). | |
--disable-bam-index-caching -DBIC |
false | If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified. | |
--disable-sequence-dictionary-validation |
false | If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk! | |
--feature-label-key |
NAME | Whether to label features by ID or Name | |
--gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
--gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
--grouping-type |
[gene] | Feature types to group by | |
--help -h |
false | display the help message | |
--interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
--intervals -L |
One or more genomic intervals over which to operate | ||
--multi-map-method |
IGNORE | How to distribute weight of reads with multiple alignments | |
--multi-overlap-method |
PROPORTIONAL | How to distribute weight of alignments which overlap multiple features | |
--overlap-type |
[exon] | Feature overlap types | |
--read-strands |
FORWARD_REVERSE | Which strands (forward or reverse) each read is expected to be on | |
--reference -R |
Reference sequence | ||
--sites-only-vcf-output |
false | If true, don't emit genotype fields when writing vcf file output. | |
--unspliced |
false | Whether the rna is unspliced. If spliced, alignments must be from an aligner run in a splice-aware mode. If unspliced, alignments must be from an aligner run in a non-splicing mode. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--add-output-sam-program-record |
true | If true, adds a PG tag to created SAM/BAM/CRAM files. | |
--add-output-vcf-command-line |
true | If true, adds a command line header line to created VCF files. | |
--create-output-bam-index -OBI |
true | If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file. | |
--create-output-bam-md5 -OBM |
false | If true, create a MD5 digest for any BAM/SAM/CRAM file created | |
--create-output-variant-index -OVI |
true | If true, create a VCF index when writing a coordinate-sorted VCF file. | |
--create-output-variant-md5 -OVM |
false | If true, create a a MD5 digest any VCF file created. | |
--disable-read-filter -DF |
Read filters to be disabled before analysis | ||
--disable-tool-default-read-filters |
false | Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on) | |
--exclude-intervals -XL |
One or more genomic intervals to exclude from processing | ||
--gatk-config-file |
A configuration file to use with the GATK. | ||
--interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
--interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
--interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
--lenient -LE |
false | Lenient processing of VCF files | |
--max-variants-per-shard |
0 | If non-zero, partitions VCF output into shards, each containing up to the given number of records. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--read-filter -RF |
Read filters to be applied before analysis | ||
--read-index |
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically. | ||
--read-validation-stringency -VS |
SILENT | Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--seconds-between-progress-updates |
10.0 | Output traversal statistics every time this many seconds elapse | |
--sequence-dictionary |
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. | ||
--tmp-dir |
Temp directory to use. | ||
--use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
--use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
--verbosity |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--add-output-sam-program-record / -add-output-sam-program-record
If true, adds a PG tag to created SAM/BAM/CRAM files.
boolean true
--add-output-vcf-command-line / -add-output-vcf-command-line
If true, adds a command line header line to created VCF files.
boolean true
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--cloud-index-prefetch-buffer / -CIPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
int -1 [ [ -∞ ∞ ] ]
--cloud-prefetch-buffer / -CPB
Size of the cloud-only prefetch buffer (in MB; 0 to disable).
int 40 [ [ -∞ ∞ ] ]
--create-output-bam-index / -OBI
If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
boolean true
--create-output-bam-md5 / -OBM
If true, create a MD5 digest for any BAM/SAM/CRAM file created
boolean false
--create-output-variant-index / -OVI
If true, create a VCF index when writing a coordinate-sorted VCF file.
boolean true
--create-output-variant-md5 / -OVM
If true, create a a MD5 digest any VCF file created.
boolean false
--disable-bam-index-caching / -DBIC
If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
boolean false
--disable-read-filter / -DF
Read filters to be disabled before analysis
List[String] []
--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation
If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
boolean false
--disable-tool-default-read-filters / -disable-tool-default-read-filters
Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
boolean false
--exclude-intervals / -XL
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite).
This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals).
List[String] []
--feature-label-key
Whether to label features by ID or Name
The --feature-label-key argument is an enumerated type (FeatureLabelType), which can have one of the following values:
- NAME
- ID
FeatureLabelType NAME
--gatk-config-file
A configuration file to use with the GATK.
String null
--gcs-max-retries / -gcs-retries
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
--gff-file / -G
Gff3 file containing feature annotations
R FeatureInput[Gff3Feature] null
--grouping-type
Feature types to group by
Set[String] [gene]
--help / -h
display the help message
boolean false
--input / -I
BAM/SAM/CRAM file containing reads
R List[GATKPath] []
--interval-exclusion-padding / -ixp
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-merging-rule / -imr
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
- ALL
- OVERLAPPING_ONLY
IntervalMergingRule ALL
--interval-padding / -ip
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
--interval-set-rule / -isr
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
- UNION
- Take the union of all intervals
- INTERSECTION
- Take the intersection of intervals (the subset that overlaps all intervals specified)
IntervalSetRule UNION
--intervals / -L
One or more genomic intervals over which to operate
List[String] []
--lenient / -LE
Lenient processing of VCF files
boolean false
--max-variants-per-shard
If non-zero, partitions VCF output into shards, each containing up to the given number of records.
int 0 [ [ 0 ∞ ] ]
--multi-map-method
How to distribute weight of reads with multiple alignments
The --multi-map-method argument is an enumerated type (MultiMapMethod), which can have one of the following values:
- IGNORE
- EQUAL
MultiMapMethod IGNORE
--multi-overlap-method
How to distribute weight of alignments which overlap multiple features
The --multi-overlap-method argument is an enumerated type (MultiOverlapMethod), which can have one of the following values:
- EQUAL
- PROPORTIONAL
MultiOverlapMethod PROPORTIONAL
--output / -O
Output file for gene expression.
R File null
--overlap-type
Feature overlap types
Set[String] [exon]
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--read-filter / -RF
Read filters to be applied before analysis
List[String] []
--read-index / -read-index
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
List[GATKPath] []
--read-strands
Which strands (forward or reverse) each read is expected to be on
The --read-strands argument is an enumerated type (ReadStrands), which can have one of the following values:
- FORWARD_FORWARD
- FORWARD_REVERSE
- REVERSE_FORWARD
- REVERSE_REVERSE
ReadStrands FORWARD_REVERSE
--read-validation-stringency / -VS
Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency SILENT
--reference / -R
Reference sequence
GATKPath null
--seconds-between-progress-updates / -seconds-between-progress-updates
Output traversal statistics every time this many seconds elapse
double 10.0 [ [ -∞ ∞ ] ]
--sequence-dictionary / -sequence-dictionary
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
GATKPath null
--showHidden / -showHidden
display hidden arguments
boolean false
--sites-only-vcf-output
If true, don't emit genotype fields when writing vcf file output.
boolean false
--tmp-dir
Temp directory to use.
GATKPath null
--unspliced
Whether the rna is unspliced. If spliced, alignments must be from an aligner run in a splice-aware mode. If unspliced, alignments must be from an aligner run in a non-splicing mode.
boolean false
--use-jdk-deflater / -jdk-deflater
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
--use-jdk-inflater / -jdk-inflater
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
--verbosity / -verbosity
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.2.2.0-SNAPSHOT built at Thu, 19 Aug 2021 09:49:28 -0700.
0 comments
Please sign in to leave a comment.