Merge alignment data from a SAM or BAM with data in an unmapped BAM file.
Summary
A command-line tool for merging BAM/SAM alignment info from a third-party aligner with the data in an unmapped BAM file, producing a third BAM file that has alignment data (from the aligner) and all the remaining data from the unmapped BAM. Quick note: this is not a tool for taking multiple sam files and creating a bigger file by merging them. For that use-case, see {@link MergeSamFiles}.Details
Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling {@link SamToFastq} and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq). The resulting file will be valid for use by Picard and GATK tools. The output may be coordinate-sorted, in which case the tags, NM, MD, and UQ will be calculated and populated, or query-name sorted, in which case the tags will not be calculated or populated.Usage example:
java -jar picard.jar MergeBamAlignment \ ALIGNED=aligned.bam \ UNMAPPED=unmapped.bam \ O=merge_alignments.bam \ R=reference_sequence.fastaCaveats
This tool has been developing for a while and many arguments have been added to it over the years. You may be particularly interested in the following (partial) list:- CLIP_ADAPTERS -- Whether to (soft-)clip the ends of the reads that are identified as belonging to adapters
- IS_BISULFITE_SEQUENCE -- Whether the sequencing originated from bisulfite sequencing, in which case NM will be calculated differently
- ALIGNER_PROPER_PAIR_FLAGS -- Use if the aligner that was used cannot be trusted to set the "Proper pair" flag and then the tool will set this flag based on orientation and distance between pairs.
- ADD_MATE_CIGAR -- Whether to use this opportunity to add the MC tag to each read.
- UNMAP_CONTAMINANT_READS (and MIN_UNCLIPPED_BASES) -- Whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads.
Category Read Data Manipulation
Overview
Summary
A command-line tool for merging BAM/SAM alignment info from a third-party aligner with the data in an unmapped BAM file, producing a third BAM file that has alignment data (from the aligner) and all the remaining data from the unmapped BAM. Quick note: this is not a tool for taking multiple sam files and creating a bigger file by merging them. For that use-case, see MergeSamFiles.Details
Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling SamToFastq and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq). The resulting file will be valid for use by Picard and GATK tools. The output may be coordinate-sorted, in which case the tags, NM, MD, and UQ will be calculated and populated, or query-name sorted, in which case the tags will not be calculated or populated.Usage example:
java -jar picard.jar MergeBamAlignment \\ ALIGNED=aligned.bam \\ UNMAPPED=unmapped.bam \\ O=merge_alignments.bam \\ R=reference_sequence.fasta
Caveats
This tool has been developing for a while and many arguments have been added to it over the years. You may be particularly interested in the following (partial) list:- CLIP_ADAPTERS -- Whether to (soft-)clip the ends of the reads that are identified as belonging to adapters
- IS_BISULFITE_SEQUENCE -- Whether the sequencing originated from bisulfite sequencing, in which case NM will be calculated differently
- ALIGNER_PROPER_PAIR_FLAGS -- Use if the aligner that was used cannot be trusted to set the "Proper pair" flag and then the tool will set this flag based on orientation and distance between pairs.
- ADD_MATE_CIGAR -- Whether to use this opportunity to add the MC tag to each read.
- UNMAP_CONTAMINANT_READS (and MIN_UNCLIPPED_BASES) -- Whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads.
MergeBamAlignment (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--OUTPUT -O |
Merged SAM or BAM file to write to. | ||
--REFERENCE_SEQUENCE -R |
Reference sequence file. | ||
--UNMAPPED_BAM -UNMAPPED |
Original SAM or BAM file of unmapped reads, which must be in queryname order. Reads MUST be unmapped. | ||
Optional Tool Arguments | |||
--ADD_MATE_CIGAR -MC |
true | Adds the mate CIGAR tag (MC) if true, does not if false. | |
--ALIGNED_BAM -ALIGNED |
SAM or BAM file(s) with alignment data. | ||
--ALIGNED_READS_ONLY |
false | Whether to output only aligned reads. | |
--ALIGNER_PROPER_PAIR_FLAGS |
false | Use the aligner's idea of what a proper pair is rather than computing in this program. | |
--arguments_file |
read one or more arguments files and add them to the command line | ||
--ATTRIBUTES_TO_REMOVE |
Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags. | ||
--ATTRIBUTES_TO_RETAIN |
Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging. | ||
--ATTRIBUTES_TO_REVERSE -RV |
[OQ, U2] | Attributes on negative strand reads that need to be reversed. | |
--ATTRIBUTES_TO_REVERSE_COMPLEMENT -RC |
[E2, SQ] | Attributes on negative strand reads that need to be reverse complemented. | |
--CLIP_ADAPTERS |
true | Whether to clip adapters where identified. | |
--CLIP_OVERLAPPING_READS |
true | For paired reads, clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate. Reads are first soft clipped so that the 3' aligned end of each read does not extend past the 5' aligned end of its mate. If HARD_CLIP_OVERLAPPING_READS is also true, then reads are additionally hard clipped so that the 3' unclipped end of each read does not extend past the 5' unclipped end of its mate. Hard clipped bases and their qualities are stored in the XB and XQ tags, respectively. | |
--EXPECTED_ORIENTATIONS -ORIENTATIONS |
The expected orientation of proper read pairs. Replaces JUMP_SIZE | ||
--HARD_CLIP_OVERLAPPING_READS |
false | If true, hard clipping will be applied to overlapping reads. By default, soft clipping is used. | |
--help -h |
false | display the help message | |
--INCLUDE_SECONDARY_ALIGNMENTS |
true | If false, do not write secondary alignments to output. | |
--IS_BISULFITE_SEQUENCE |
false | Whether the lane is bisulfite sequence (used when calculating the NM tag). | |
--MATCHING_DICTIONARY_TAGS |
[M5, LN] | List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise. | |
--MAX_INSERTIONS_OR_DELETIONS -MAX_GAPS |
1 | The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions. | |
--MIN_UNCLIPPED_BASES |
32 | If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant. | |
--PRIMARY_ALIGNMENT_STRATEGY |
BestMapq | Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. For all strategies, ties are resolved arbitrarily. | |
--PROGRAM_GROUP_COMMAND_LINE -PG_COMMAND |
The command line of the program group (if not supplied by the aligned file). | ||
--PROGRAM_GROUP_NAME -PG_NAME |
The name of the program group (if not supplied by the aligned file). | ||
--PROGRAM_GROUP_VERSION -PG_VERSION |
The version of the program group (if not supplied by the aligned file). | ||
--PROGRAM_RECORD_ID -PG |
The program group ID of the aligner (if not supplied by the aligned file). | ||
--READ1_ALIGNED_BAM -R1_ALIGNED |
SAM or BAM file(s) with alignment data from the first read of a pair. | ||
--READ1_TRIM -R1_TRIM |
0 | The number of bases trimmed from the beginning of read 1 prior to alignment | |
--READ2_ALIGNED_BAM -R2_ALIGNED |
SAM or BAM file(s) with alignment data from the second read of a pair. | ||
--READ2_TRIM -R2_TRIM |
0 | The number of bases trimmed from the beginning of read 2 prior to alignment | |
--SORT_ORDER -SO |
coordinate | The order in which the merged reads should be output. | |
--UNMAP_CONTAMINANT_READS -UNMAP_CONTAM |
false | Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly. | |
--UNMAPPED_READ_STRATEGY |
DO_NOT_CHANGE | How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true. Note that the DO_NOT_CHANGE strategy will actually reset the cigar and set the mapping quality on unmapped reads since otherwisethe result will be an invalid record. To force no change use the DO_NOT_CHANGE_INVALID strategy. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--ADD_PG_TAG_TO_READS |
true | Add PG tag to each read in a SAM or BAM | |
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create an index when writing VCF or coordinate sorted BAM output. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
One or more directories with space available to be used by this program for temporary storage of working files | ||
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments | |
Deprecated Arguments | |||
--JUMP_SIZE -JUMP |
The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead | ||
--PAIRED_RUN -PE |
true | DEPRECATED. This argument is ignored and will be removed. |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--ADD_MATE_CIGAR / -MC
Adds the mate CIGAR tag (MC) if true, does not if false.
Boolean true
--ADD_PG_TAG_TO_READS
Add PG tag to each read in a SAM or BAM
boolean true
--ALIGNED_BAM / -ALIGNED
SAM or BAM file(s) with alignment data.
Exclusion: This argument cannot be used at the same time as READ1_ALIGNED_BAM, READ2_ALIGNED_BAM
.
List[File] []
--ALIGNED_READS_ONLY
Whether to output only aligned reads.
boolean false
--ALIGNER_PROPER_PAIR_FLAGS
Use the aligner's idea of what a proper pair is rather than computing in this program.
boolean false
--arguments_file
read one or more arguments files and add them to the command line
List[File] []
--ATTRIBUTES_TO_REMOVE
Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags.
List[String] []
--ATTRIBUTES_TO_RETAIN
Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging.
List[String] []
--ATTRIBUTES_TO_REVERSE / -RV
Attributes on negative strand reads that need to be reversed.
Set[String] [OQ, U2]
--ATTRIBUTES_TO_REVERSE_COMPLEMENT / -RC
Attributes on negative strand reads that need to be reverse complemented.
Set[String] [E2, SQ]
--CLIP_ADAPTERS
Whether to clip adapters where identified.
boolean true
--CLIP_OVERLAPPING_READS
For paired reads, clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate. Reads are first soft clipped so that the 3' aligned end of each read does not extend past the 5' aligned end of its mate. If HARD_CLIP_OVERLAPPING_READS is also true, then reads are additionally hard clipped so that the 3' unclipped end of each read does not extend past the 5' unclipped end of its mate. Hard clipped bases and their qualities are stored in the XB and XQ tags, respectively.
boolean true
--COMPRESSION_LEVEL
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX
Whether to create an index when writing VCF or coordinate sorted BAM output.
Boolean false
--CREATE_MD5_FILE
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--EXPECTED_ORIENTATIONS / -ORIENTATIONS
The expected orientation of proper read pairs. Replaces JUMP_SIZE
Exclusion: This argument cannot be used at the same time as JUMP_SIZE
.
The --EXPECTED_ORIENTATIONS argument is an enumerated type (List[PairOrientation]), which can have one of the following values:
- FR
- RF
- TANDEM
List[PairOrientation] []
--GA4GH_CLIENT_SECRETS
Google Genomics API client_secrets.json file path.
String client_secrets.json
--HARD_CLIP_OVERLAPPING_READS
If true, hard clipping will be applied to overlapping reads. By default, soft clipping is used.
boolean false
--help / -h
display the help message
boolean false
--INCLUDE_SECONDARY_ALIGNMENTS
If false, do not write secondary alignments to output.
boolean true
--IS_BISULFITE_SEQUENCE
Whether the lane is bisulfite sequence (used when calculating the NM tag).
boolean false
--JUMP_SIZE / -JUMP
The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead
Exclusion: This argument cannot be used at the same time as EXPECTED_ORIENTATIONS
.
Integer null
--MATCHING_DICTIONARY_TAGS
List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise.
List[String] [M5, LN]
--MAX_INSERTIONS_OR_DELETIONS / -MAX_GAPS
The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions.
int 1 [ [ -∞ ∞ ] ]
--MAX_RECORDS_IN_RAM
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MIN_UNCLIPPED_BASES
If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant.
int 32 [ [ -∞ ∞ ] ]
--OUTPUT / -O
Merged SAM or BAM file to write to.
R File null
--PAIRED_RUN / -PE
DEPRECATED. This argument is ignored and will be removed.
Boolean true
--PRIMARY_ALIGNMENT_STRATEGY
Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. For all strategies, ties are resolved arbitrarily.
The --PRIMARY_ALIGNMENT_STRATEGY argument is an enumerated type (PrimaryAlignmentStrategy), which can have one of the following values:
- BestMapq
- Expects that multiple alignments will be correlated with HI tag, and prefers the pair of alignments with the largest MAPQ, in the absence of a primary selected by the aligner.
- EarliestFragment
- Prefers the alignment which maps the earliest base in the read. Note that EarliestFragment may not be used for paired reads.
- BestEndMapq
- Appropriate for cases in which the aligner is not pair-aware, and does not output the HI tag. It simply picks the alignment for each end with the highest MAPQ, and makes those alignments primary, regardless of whether the two alignments make sense together.
- MostDistant
- Appropriate for a non-pair-aware aligner. Picks the alignment pair with the largest insert size. If all alignments would be chimeric, it picks the alignments for each end with the best MAPQ.
PrimaryAlignmentStrategy BestMapq
--PROGRAM_GROUP_COMMAND_LINE / -PG_COMMAND
The command line of the program group (if not supplied by the aligned file).
String null
--PROGRAM_GROUP_NAME / -PG_NAME
The name of the program group (if not supplied by the aligned file).
String null
--PROGRAM_GROUP_VERSION / -PG_VERSION
The version of the program group (if not supplied by the aligned file).
String null
--PROGRAM_RECORD_ID / -PG
The program group ID of the aligner (if not supplied by the aligned file).
String null
--QUIET
Whether to suppress job-summary info on System.err.
Boolean false
--READ1_ALIGNED_BAM / -R1_ALIGNED
SAM or BAM file(s) with alignment data from the first read of a pair.
Exclusion: This argument cannot be used at the same time as ALIGNED_BAM
.
List[File] []
--READ1_TRIM / -R1_TRIM
The number of bases trimmed from the beginning of read 1 prior to alignment
int 0 [ [ -∞ ∞ ] ]
--READ2_ALIGNED_BAM / -R2_ALIGNED
SAM or BAM file(s) with alignment data from the second read of a pair.
Exclusion: This argument cannot be used at the same time as ALIGNED_BAM
.
List[File] []
--READ2_TRIM / -R2_TRIM
The number of bases trimmed from the beginning of read 2 prior to alignment
int 0 [ [ -∞ ∞ ] ]
--REFERENCE_SEQUENCE / -R
Reference sequence file.
R File null
--showHidden / -showHidden
display hidden arguments
boolean false
--SORT_ORDER / -SO
The order in which the merged reads should be output.
The --SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values:
- unsorted
- queryname
- coordinate
- duplicate
- unknown
SortOrder coordinate
--TMP_DIR
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--UNMAP_CONTAMINANT_READS / -UNMAP_CONTAM
Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly.
boolean false
--UNMAPPED_BAM / -UNMAPPED
Original SAM or BAM file of unmapped reads, which must be in queryname order. Reads MUST be unmapped.
R File null
--UNMAPPED_READ_STRATEGY
How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true. Note that the DO_NOT_CHANGE strategy will actually reset the cigar and set the mapping quality on unmapped reads since otherwisethe result will be an invalid record. To force no change use the DO_NOT_CHANGE_INVALID strategy.
The --UNMAPPED_READ_STRATEGY argument is an enumerated type (UnmappingReadStrategy), which can have one of the following values:
- COPY_TO_TAG
- DO_NOT_CHANGE
- DO_NOT_CHANGE_INVALID
- MOVE_TO_TAG
UnmappingReadStrategy DO_NOT_CHANGE
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version
display the version number for this tool
boolean false
GATK version 4.2.0.0-SNAPSHOT built at Mon, 22 Feb 2021 13:44:49 -0800.
0 comments
Please sign in to leave a comment.