Identifies duplicate reads using information from read positions and UMIs.
This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads aredefined as originating from a single fragment of DNA. It is based on the {@link MarkDuplicatesWithMateCigar} tool, with added logicto leverage Unique Molecular Identifier (UMI) information.
It makes use of the fact that duplicate sets with UMIs can be broken up into subsets based oninformation contained in the UMI. In addition to assuming that all members of a duplicate set must have the same start and end position, it imposes thatthey must also have sufficiently similar UMIs. In this context, 'sufficiently similar' is parameterized by the command lineargument MAX_EDIT_DISTANCE_TO_JOIN, which sets the edit distance between UMIs that will be considered to be part of the sameoriginal molecule. This logic allows for sequencing errors in UMIs.
If UMIs contain dashes, the dashes will be ignored. If UMIs contain Ns, these UMIs will not contribute to UMI metricsassociated with each record. If the MAX_EDIT_DISTANCE_TO_JOIN allows, UMIs with Ns will be included in the duplicate set andthe UMI metrics associated with each duplicate set. Ns are counted as an edit distance from other bases {ATCG}, but are notconsidered different from each other.
This tool is NOT intended to be used on data without UMIs; for marking duplicates in non-UMI data, see {@link MarkDuplicates} or{@link MarkDuplicatesWithMateCigar}. Mixed data (where some reads have UMIs and others do not) is not supported.
Note also that this tool will not work with alignments that have large gaps or deletions, such as those from RNA-seq data.This is due to the need to buffer small genomic windows to ensure integrity of the duplicate marking, while large skips(ex. skipping introns) in the alignment records would force making that window very large, thus exhausting memory.
Note: Metrics labeled as percentages are actually expressed as fractions!
Usage example:
java -jar picard.jar UmiAwareMarkDuplicatesWithMateCigar
I=input.bam
O=output.bam
M=output_duplicate_metrics.txt
UMI_METRICS=output_umi_metrics.txt
Category Read Data Manipulation
Overview
This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. It is based on the MarkDuplicatesWithMateCigar tool, with added logic to leverage Unique Molecular Identifier (UMI) information.
It makes use of the fact that duplicate sets with UMIs can be broken up into subsets based on information contained in the UMI. In addition to assuming that all members of a duplicate set must have the same start and end position, it imposes that they must also have sufficiently similar UMIs. In this context, 'sufficiently similar' is parameterized by the command line argument MAX_EDIT_DISTANCE_TO_JOIN, which sets the edit distance between UMIs that will be considered to be part of the same original molecule. This logic allows for sequencing errors in UMIs.
If UMIs contain dashes, the dashes will be ignored. If UMIs contain Ns, these UMIs will not contribute to UMI metrics associated with each record. If the MAX_EDIT_DISTANCE_TO_JOIN allows, UMIs with Ns will be included in the duplicate set and the UMI metrics associated with each duplicate set. Ns are counted as an edit distance from other bases {ATCG}, but are not considered different from each other.
This tool is NOT intended to be used on data without UMIs; for marking duplicates in non-UMI data, see MarkDuplicates or MarkDuplicatesWithMateCigar. Mixed data (where some reads have UMIs and others do not) is not supported.
Note also that this tool will not work with alignments that have large gaps or deletions, such as those from RNA-seq data. This is due to the need to buffer small genomic windows to ensure integrity of the duplicate marking, while large skips (ex. skipping introns) in the alignment records would force making that window very large, thus exhausting memory.
Note: Metrics labeled as percentages are actually expressed as fractions!
Usage example:
java -jar picard.jar UmiAwareMarkDuplicatesWithMateCigar \\
I=input.bam \\
O=output.bam \\
M=output_duplicate_metrics.txt \\
UMI_METRICS=output_umi_metrics.txt
UmiAwareMarkDuplicatesWithMateCigar (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--INPUT -I |
One or more input SAM or BAM files to analyze. Must be coordinate sorted. | ||
--METRICS_FILE -M |
File to write duplication metrics to | ||
--OUTPUT -O |
The output file to write marked records to | ||
--UMI_METRICS_FILE -UMI_METRICS |
UMI Metrics | ||
Optional Tool Arguments | |||
--ALLOW_MISSING_UMIS |
false | FOR TESTING ONLY: allow for missing UMIs if data doesn't have UMIs. This option is intended to be used ONLY for testing the code. Use MarkDuplicatesWithMateCigar if data has no UMIs. Mixed data (where some reads have UMIs and others do not) is not supported. | |
--arguments_file |
read one or more arguments files and add them to the command line | ||
--ASSUME_SORT_ORDER -ASO |
If not null, assume that the input file has this order even if the header says otherwise. | ||
--BARCODE_TAG |
Barcode SAM tag (ex. BC for 10X Genomics) | ||
--CLEAR_DT |
true | Clear DT tag from input SAM records. Should be set to false if input SAM doesn't have this tag. Default true | |
--COMMENT -CO |
Comment(s) to include in the output file's header. | ||
--DUPLEX_UMI |
false | Treat UMIs as being duplex stranded. This option requires that the UMI consist of two equal length strings that are separated by a hyphen (e.g. 'ATC-GTC'). Reads are considered duplicates if, in addition to standard definition, have identical normalized UMIs. A UMI from the 'bottom' strand is normalized by swapping its content around the hyphen (eg. ATC-GTC becomes GTC-ATC). A UMI from the 'top' strand is already normalized as it is. Both reads from a read pair considered top strand if the read 1 unclipped 5' coordinate is less than the read 2 unclipped 5' coordinate. All chimeric reads and read fragments are treated as having come from the top strand. With this option is it required that the BARCODE_TAG hold non-normalized UMIs. Default false. | |
--DUPLICATE_SCORING_STRATEGY -DS |
SUM_OF_BASE_QUALITIES | The scoring strategy for choosing the non-duplicate among candidates. | |
--help -h |
false | display the help message | |
--MAX_EDIT_DISTANCE_TO_JOIN |
1 | Largest edit distance that UMIs must have in order to be considered as coming from distinct source molecules. | |
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP -MAX_FILE_HANDLES |
8000 | Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system. | |
--MAX_OPTICAL_DUPLICATE_SET_SIZE |
300000 | This number is the maximum size of a set of duplicate reads for which we will attempt to determine which are optical duplicates. Please be aware that if you raise this value too high and do encounter a very large set of duplicate reads, it will severely affect the runtime of this tool. To completely disable this check, set the value to -1. | |
--MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP -MAX_SEQS |
50000 | This option is obsolete. ReadEnds will always be spilled to disk. | |
--MOLECULAR_IDENTIFIER_TAG |
SAM tag to uniquely identify the molecule from which a read was derived. Use of this option requires that the BARCODE_TAG option be set to a non null value. Default null. | ||
--OPTICAL_DUPLICATE_PIXEL_DISTANCE |
100 | The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. | |
--PROGRAM_GROUP_COMMAND_LINE -PG_COMMAND |
Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. | ||
--PROGRAM_GROUP_NAME -PG_NAME |
UmiAwareMarkDuplicatesWithMateCigar | Value of PN tag of PG record to be created. | |
--PROGRAM_GROUP_VERSION -PG_VERSION |
Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. | ||
--PROGRAM_RECORD_ID -PG |
MarkDuplicates | The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. | |
--READ_NAME_REGEX |
MarkDuplicates can use the tile and cluster positions to estimate the rate of optical duplication in addition to the dominant source of duplication, PCR, to provide a more accurate estimation of library size. By default (with no READ_NAME_REGEX specified), MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without optical duplicate counts, library size estimation will be less accurate. If the read name does not follow a standard Illumina colon-separation convention, but does contain tile and x,y coordinates, a regular expression can be specified to extract three variables: tile/region, x coordinate and y coordinate from a read name. The regular expression must contain three capture groups for the three variables, in order. It must match the entire read name. e.g. if field names were separated by semi-colon (';') this example regex could be specified (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. | ||
--READ_ONE_BARCODE_TAG |
Read one barcode SAM tag (ex. BX for 10X Genomics) | ||
--READ_TWO_BARCODE_TAG |
Read two barcode SAM tag (ex. BX for 10X Genomics) | ||
--REMOVE_DUPLICATES |
false | If true do not write duplicates to the output file instead of writing them with appropriate flags set. | |
--REMOVE_SEQUENCING_DUPLICATES |
false | If true remove 'optical' duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process, even if REMOVE_DUPLICATES is false. If REMOVE_DUPLICATES is true, all duplicates are removed and this option is ignored. | |
--SORTING_COLLECTION_SIZE_RATIO |
0.25 | This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number. | |
--TAG_DUPLICATE_SET_MEMBERS |
false | If a read appears in a duplicate set, add two tags. The first tag, DUPLICATE_SET_SIZE_TAG (DS), indicates the size of the duplicate set. The smallest possible DS value is 2 which occurs when two reads map to the same portion of the reference only one of which is marked as duplicate. The second tag, DUPLICATE_SET_INDEX_TAG (DI), represents a unique identifier for the duplicate set to which the record belongs. This identifier is the index-in-file of the representative read that was selected out of the duplicate set. | |
--TAGGING_POLICY |
DontTag | Determines how duplicate types are recorded in the DT optional attribute. | |
--UMI_TAG_NAME |
RX | Tag name to use for UMI | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--ADD_PG_TAG_TO_READS |
true | Add PG tag to each read in a SAM or BAM | |
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create an index when writing VCF or coordinate sorted BAM output. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--REFERENCE_SEQUENCE -R |
Reference sequence file. | ||
--TMP_DIR |
One or more directories with space available to be used by this program for temporary storage of working files | ||
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments | |
Deprecated Arguments | |||
--ASSUME_SORTED -AS |
false | If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--ADD_PG_TAG_TO_READS
Add PG tag to each read in a SAM or BAM
boolean true
--ALLOW_MISSING_UMIS
FOR TESTING ONLY: allow for missing UMIs if data doesn't have UMIs. This option is intended to be used ONLY for testing the code. Use MarkDuplicatesWithMateCigar if data has no UMIs. Mixed data (where some reads have UMIs and others do not) is not supported.
boolean false
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--ASSUME_SORT_ORDER / -ASO
If not null, assume that the input file has this order even if the header says otherwise.
Exclusion: This argument cannot be used at the same time as ASSUME_SORTED
.
The --ASSUME_SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values:
- unsorted
- queryname
- coordinate
- duplicate
- unknown
SortOrder null
--ASSUME_SORTED / -AS
If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead.
Exclusion: This argument cannot be used at the same time as ASSUME_SORT_ORDER
.
boolean false
--BARCODE_TAG
Barcode SAM tag (ex. BC for 10X Genomics)
String null
--CLEAR_DT
Clear DT tag from input SAM records. Should be set to false if input SAM doesn't have this tag. Default true
boolean true
--COMMENT / -CO
Comment(s) to include in the output file's header.
List[String] []
--COMPRESSION_LEVEL
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX
Whether to create an index when writing VCF or coordinate sorted BAM output.
Boolean false
--CREATE_MD5_FILE
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--DUPLEX_UMI
Treat UMIs as being duplex stranded. This option requires that the UMI consist of two equal length strings that are separated by a hyphen (e.g. 'ATC-GTC'). Reads are considered duplicates if, in addition to standard definition, have identical normalized UMIs. A UMI from the 'bottom' strand is normalized by swapping its content around the hyphen (eg. ATC-GTC becomes GTC-ATC). A UMI from the 'top' strand is already normalized as it is. Both reads from a read pair considered top strand if the read 1 unclipped 5' coordinate is less than the read 2 unclipped 5' coordinate. All chimeric reads and read fragments are treated as having come from the top strand. With this option is it required that the BARCODE_TAG hold non-normalized UMIs. Default false.
boolean false
--DUPLICATE_SCORING_STRATEGY / -DS
The scoring strategy for choosing the non-duplicate among candidates.
The --DUPLICATE_SCORING_STRATEGY argument is an enumerated type (ScoringStrategy), which can have one of the following values:
- SUM_OF_BASE_QUALITIES
- TOTAL_MAPPED_REFERENCE_LENGTH
- RANDOM
ScoringStrategy SUM_OF_BASE_QUALITIES
--GA4GH_CLIENT_SECRETS
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--INPUT / -I
One or more input SAM or BAM files to analyze. Must be coordinate sorted.
R List[String] []
--MAX_EDIT_DISTANCE_TO_JOIN / -MAX_EDIT_DISTANCE_TO_JOIN
Largest edit distance that UMIs must have in order to be considered as coming from distinct source molecules.
int 1 [ [ -∞ ∞ ] ]
--MAX_FILE_HANDLES_FOR_READ_ENDS_MAP / -MAX_FILE_HANDLES
Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system.
int 8000 [ [ -∞ ∞ ] ]
--MAX_OPTICAL_DUPLICATE_SET_SIZE
This number is the maximum size of a set of duplicate reads for which we will attempt to determine which are optical duplicates. Please be aware that if you raise this value too high and do encounter a very large set of duplicate reads, it will severely affect the runtime of this tool. To completely disable this check, set the value to -1.
long 300000 [ [ -∞ ∞ ] ]
--MAX_RECORDS_IN_RAM
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP / -MAX_SEQS
This option is obsolete. ReadEnds will always be spilled to disk.
If more than this many sequences in SAM file, don't spill to disk because there will not
be enough file handles.
int 50000 [ [ -∞ ∞ ] ]
--METRICS_FILE / -M
File to write duplication metrics to
R File null
--MOLECULAR_IDENTIFIER_TAG
SAM tag to uniquely identify the molecule from which a read was derived. Use of this option requires that the BARCODE_TAG option be set to a non null value. Default null.
String null
--OPTICAL_DUPLICATE_PIXEL_DISTANCE
The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best.
int 100 [ [ -∞ ∞ ] ]
--OUTPUT / -O
The output file to write marked records to
R File null
--PROGRAM_GROUP_COMMAND_LINE / -PG_COMMAND
Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically.
String null
--PROGRAM_GROUP_NAME / -PG_NAME
Value of PN tag of PG record to be created.
String UmiAwareMarkDuplicatesWithMateCigar
--PROGRAM_GROUP_VERSION / -PG_VERSION
Value of VN tag of PG record to be created. If not specified, the version will be detected automatically.
String null
--PROGRAM_RECORD_ID / -PG
The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs.
String MarkDuplicates
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--READ_NAME_REGEX
MarkDuplicates can use the tile and cluster positions to estimate the rate of optical duplication in addition to the dominant source of duplication, PCR, to provide a more accurate estimation of library size. By default (with no READ_NAME_REGEX specified), MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without optical duplicate counts, library size estimation will be less accurate. If the read name does not follow a standard Illumina colon-separation convention, but does contain tile and x,y coordinates, a regular expression can be specified to extract three variables: tile/region, x coordinate and y coordinate from a read name. The regular expression must contain three capture groups for the three variables, in order. It must match the entire read name. e.g. if field names were separated by semi-colon (';') this example regex could be specified (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.
String
--READ_ONE_BARCODE_TAG
Read one barcode SAM tag (ex. BX for 10X Genomics)
String null
--READ_TWO_BARCODE_TAG
Read two barcode SAM tag (ex. BX for 10X Genomics)
String null
--REFERENCE_SEQUENCE / -R
Reference sequence file.
File null
--REMOVE_DUPLICATES
If true do not write duplicates to the output file instead of writing them with appropriate flags set.
boolean false
--REMOVE_SEQUENCING_DUPLICATES
If true remove 'optical' duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process, even if REMOVE_DUPLICATES is false. If REMOVE_DUPLICATES is true, all duplicates are removed and this option is ignored.
boolean false
--showHidden / -showHidden
display hidden arguments
boolean false
--SORTING_COLLECTION_SIZE_RATIO
This number, plus the maximum RAM available to the JVM, determine the memory footprint used by some of the sorting collections. If you are running out of memory, try reducing this number.
double 0.25 [ [ -∞ ∞ ] ]
--TAG_DUPLICATE_SET_MEMBERS
If a read appears in a duplicate set, add two tags. The first tag, DUPLICATE_SET_SIZE_TAG (DS), indicates the size of the duplicate set. The smallest possible DS value is 2 which occurs when two reads map to the same portion of the reference only one of which is marked as duplicate. The second tag, DUPLICATE_SET_INDEX_TAG (DI), represents a unique identifier for the duplicate set to which the record belongs. This identifier is the index-in-file of the representative read that was selected out of the duplicate set.
boolean false
--TAGGING_POLICY
Determines how duplicate types are recorded in the DT optional attribute.
The --TAGGING_POLICY argument is an enumerated type (DuplicateTaggingPolicy), which can have one of the following values:
- DontTag
- OpticalOnly
- All
DuplicateTaggingPolicy DontTag
--TMP_DIR
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--UMI_METRICS_FILE / -UMI_METRICS
UMI Metrics
R File null
--UMI_TAG_NAME / -UMI_TAG_NAME
Tag name to use for UMI
String RX
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.2.2.0-SNAPSHOT built at Thu, 19 Aug 2021 09:49:28 -0700.
0 comments
Please sign in to leave a comment.