Merge alignment data from a SAM or BAM with data in an unmapped BAM file.
Summary
A command-line tool for merging BAM/SAM alignment info from a third-party aligner with the data in an unmapped BAM file, producing a third BAM file that has alignment data (from the aligner) and all the remaining data from the unmapped BAM. Quick note: this is not a tool for taking multiple sam files and creating a bigger file by merging them. For that use-case, see {@link MergeSamFiles}.Details
Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling {@link SamToFastq} and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq). The resulting file will be valid for use by Picard and GATK tools. The output may be coordinate-sorted, in which case the tags, NM, MD, and UQ will be calculated and populated, or query-name sorted, in which case the tags will not be calculated or populated.Usage example:
java -jar picard.jar MergeBamAlignment \ ALIGNED=aligned.bam \ UNMAPPED=unmapped.bam \ O=merge_alignments.bam \ R=reference_sequence.fastaCaveats
This tool has been developing for a while and many arguments have been added to it over the years. You may be particularly interested in the following (partial) list:- CLIP_ADAPTERS -- Whether to (soft-)clip the ends of the reads that are identified as belonging to adapters
- IS_BISULFITE_SEQUENCE -- Whether the sequencing originated from bisulfite sequencing, in which case NM will be calculated differently
- ALIGNER_PROPER_PAIR_FLAGS -- Use if the aligner that was used cannot be trusted to set the "Proper pair" flag and then the tool will set this flag based on orientation and distance between pairs.
- ADD_MATE_CIGAR -- Whether to use this opportunity to add the MC tag to each read.
- UNMAP_CONTAMINANT_READS (and MIN_UNCLIPPED_BASES) -- Whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads.
Category Read Data Manipulation
Overview
Summary
A command-line tool for merging BAM/SAM alignment info from a third-party aligner with the data in an unmapped BAM file, producing a third BAM file that has alignment data (from the aligner) and all the remaining data from the unmapped BAM. Quick note: this is not a tool for taking multiple sam files and creating a bigger file by merging them. For that use-case, see MergeSamFiles.Details
Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling SamToFastq and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq). The resulting file will be valid for use by Picard and GATK tools. The output may be coordinate-sorted, in which case the tags, NM, MD, and UQ will be calculated and populated, or query-name sorted, in which case the tags will not be calculated or populated.Usage example:
java -jar picard.jar MergeBamAlignment \\ ALIGNED=aligned.bam \\ UNMAPPED=unmapped.bam \\ O=merge_alignments.bam \\ R=reference_sequence.fasta
Caveats
This tool has been developing for a while and many arguments have been added to it over the years. You may be particularly interested in the following (partial) list:- CLIP_ADAPTERS -- Whether to (soft-)clip the ends of the reads that are identified as belonging to adapters
- IS_BISULFITE_SEQUENCE -- Whether the sequencing originated from bisulfite sequencing, in which case NM will be calculated differently
- ALIGNER_PROPER_PAIR_FLAGS -- Use if the aligner that was used cannot be trusted to set the "Proper pair" flag and then the tool will set this flag based on orientation and distance between pairs.
- ADD_MATE_CIGAR -- Whether to use this opportunity to add the MC tag to each read.
- UNMAP_CONTAMINANT_READS (and MIN_UNCLIPPED_BASES) -- Whether to identify extremely short alignments (with clipping on both sides) as cross-species contamination and unmap the reads.
MergeBamAlignment (Picard) specific arguments
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--OUTPUT -O |
null | Merged SAM or BAM file to write to. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
--UNMAPPED_BAM -UNMAPPED |
null | Original SAM or BAM file of unmapped reads, which must be in queryname order. | |
Optional Tool Arguments | |||
--ADD_MATE_CIGAR -MC |
true | Adds the mate CIGAR tag (MC) if true, does not if false. | |
--ALIGNED_BAM -ALIGNED |
[] | SAM or BAM file(s) with alignment data. | |
--ALIGNED_READS_ONLY |
false | Whether to output only aligned reads. | |
--ALIGNER_PROPER_PAIR_FLAGS |
false | Use the aligner's idea of what a proper pair is rather than computing in this program. | |
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--ATTRIBUTES_TO_REMOVE |
[] | Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags. | |
--ATTRIBUTES_TO_RETAIN |
[] | Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging. | |
--ATTRIBUTES_TO_REVERSE -RV |
[OQ, U2] | Attributes on negative strand reads that need to be reversed. | |
--ATTRIBUTES_TO_REVERSE_COMPLEMENT -RC |
[E2, SQ] | Attributes on negative strand reads that need to be reverse complemented. | |
--CLIP_ADAPTERS |
true | Whether to clip adapters where identified. | |
--CLIP_OVERLAPPING_READS |
true | For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate. | |
--EXPECTED_ORIENTATIONS -ORIENTATIONS |
[] | The expected orientation of proper read pairs. Replaces JUMP_SIZE | |
--help -h |
false | display the help message | |
--INCLUDE_SECONDARY_ALIGNMENTS |
true | If false, do not write secondary alignments to output. | |
--IS_BISULFITE_SEQUENCE |
false | Whether the lane is bisulfite sequence (used when calculating the NM tag). | |
--MATCHING_DICTIONARY_TAGS |
[M5, LN] | List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise. | |
--MAX_INSERTIONS_OR_DELETIONS -MAX_GAPS |
1 | The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions. | |
--MIN_UNCLIPPED_BASES |
32 | If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant. | |
--PRIMARY_ALIGNMENT_STRATEGY |
BestMapq | Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. For all strategies, ties are resolved arbitrarily. | |
--PROGRAM_GROUP_COMMAND_LINE -PG_COMMAND |
null | The command line of the program group (if not supplied by the aligned file). | |
--PROGRAM_GROUP_NAME -PG_NAME |
null | The name of the program group (if not supplied by the aligned file). | |
--PROGRAM_GROUP_VERSION -PG_VERSION |
null | The version of the program group (if not supplied by the aligned file). | |
--PROGRAM_RECORD_ID -PG |
null | The program group ID of the aligner (if not supplied by the aligned file). | |
--READ1_ALIGNED_BAM -R1_ALIGNED |
[] | SAM or BAM file(s) with alignment data from the first read of a pair. | |
--READ1_TRIM -R1_TRIM |
0 | The number of bases trimmed from the beginning of read 1 prior to alignment | |
--READ2_ALIGNED_BAM -R2_ALIGNED |
[] | SAM or BAM file(s) with alignment data from the second read of a pair. | |
--READ2_TRIM -R2_TRIM |
0 | The number of bases trimmed from the beginning of read 2 prior to alignment | |
--SORT_ORDER -SO |
coordinate | The order in which the merged reads should be output. | |
--UNMAP_CONTAMINANT_READS -UNMAP_CONTAM |
false | Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly. | |
--UNMAPPED_READ_STRATEGY |
DO_NOT_CHANGE | How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--ADD_PG_TAG_TO_READS |
true | Add PG tag to each read in a SAM or BAM | |
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments | |
Deprecated Arguments | |||
--JUMP_SIZE -JUMP |
null | The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead | |
--PAIRED_RUN -PE |
true | DEPRECATED. This argument is ignored and will be removed. |
Argument details
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
--ADD_MATE_CIGAR / -MC
Adds the mate CIGAR tag (MC) if true, does not if false.
Boolean true
--ADD_PG_TAG_TO_READS / NA
Add PG tag to each read in a SAM or BAM
Boolean true
--ALIGNED_BAM / -ALIGNED
SAM or BAM file(s) with alignment data.
Exclusion: This argument cannot be used at the same time as READ1_ALIGNED_BAM, READ2_ALIGNED_BAM, R1_ALIGNED, R2_ALIGNED
.
List[File] []
--ALIGNED_READS_ONLY / NA
Whether to output only aligned reads.
boolean false
--ALIGNER_PROPER_PAIR_FLAGS / NA
Use the aligner's idea of what a proper pair is rather than computing in this program.
boolean false
--arguments_file / NA
read one or more arguments files and add them to the command line
List[File] []
--ATTRIBUTES_TO_REMOVE / NA
Attributes from the alignment record that should be removed when merging. This overrides ATTRIBUTES_TO_RETAIN if they share common tags.
List[String] []
--ATTRIBUTES_TO_RETAIN / NA
Reserved alignment attributes (tags starting with X, Y, or Z) that should be brought over from the alignment data when merging.
List[String] []
--ATTRIBUTES_TO_REVERSE / -RV
Attributes on negative strand reads that need to be reversed.
Set[String] [OQ, U2]
--ATTRIBUTES_TO_REVERSE_COMPLEMENT / -RC
Attributes on negative strand reads that need to be reverse complemented.
Set[String] [E2, SQ]
--CLIP_ADAPTERS / NA
Whether to clip adapters where identified.
boolean true
--CLIP_OVERLAPPING_READS / NA
For paired reads, soft clip the 3' end of each read if necessary so that it does not extend past the 5' end of its mate.
boolean true
--COMPRESSION_LEVEL / NA
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
--CREATE_INDEX / NA
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
--CREATE_MD5_FILE / NA
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
--EXPECTED_ORIENTATIONS / -ORIENTATIONS
The expected orientation of proper read pairs. Replaces JUMP_SIZE
Exclusion: This argument cannot be used at the same time as JUMP_SIZE
.
List[PairOrientation] []
--GA4GH_CLIENT_SECRETS / NA
Google Genomics API client_secrets.json file path.
String client_secrets.json
--help / -h
display the help message
boolean false
--INCLUDE_SECONDARY_ALIGNMENTS / NA
If false, do not write secondary alignments to output.
boolean true
--IS_BISULFITE_SEQUENCE / NA
Whether the lane is bisulfite sequence (used when calculating the NM tag).
boolean false
--JUMP_SIZE / -JUMP
The expected jump size (required if this is a jumping library). Deprecated. Use EXPECTED_ORIENTATIONS instead
Exclusion: This argument cannot be used at the same time as EXPECTED_ORIENTATIONS, ORIENTATIONS
.
Integer null
--MATCHING_DICTIONARY_TAGS / NA
List of Sequence Records tags that must be equal (if present) in the reference dictionary and in the aligned file. Mismatching tags will cause an error if in this list, and a warning otherwise.
List[String] [M5, LN]
--MAX_INSERTIONS_OR_DELETIONS / -MAX_GAPS
The maximum number of insertions or deletions permitted for an alignment to be included. Alignments with more than this many insertions or deletions will be ignored. Set to -1 to allow any number of insertions or deletions.
int 1 [ [ -∞ ∞ ] ]
--MAX_RECORDS_IN_RAM / NA
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
--MIN_UNCLIPPED_BASES / NA
If UNMAP_CONTAMINANT_READS is set, require this many unclipped bases or else the read will be marked as contaminant.
int 32 [ [ -∞ ∞ ] ]
--OUTPUT / -O
Merged SAM or BAM file to write to.
R File null
--PAIRED_RUN / -PE
DEPRECATED. This argument is ignored and will be removed.
Boolean true
--PRIMARY_ALIGNMENT_STRATEGY / NA
Strategy for selecting primary alignment when the aligner has provided more than one alignment for a pair or fragment, and none are marked as primary, more than one is marked as primary, or the primary alignment is filtered out for some reason. For all strategies, ties are resolved arbitrarily.
The --PRIMARY_ALIGNMENT_STRATEGY argument is an enumerated type (PrimaryAlignmentStrategy), which can have one of the following values:
- BestMapq
- EarliestFragment
- BestEndMapq
- MostDistant
PrimaryAlignmentStrategy BestMapq
--PROGRAM_GROUP_COMMAND_LINE / -PG_COMMAND
The command line of the program group (if not supplied by the aligned file).
String null
--PROGRAM_GROUP_NAME / -PG_NAME
The name of the program group (if not supplied by the aligned file).
String null
--PROGRAM_GROUP_VERSION / -PG_VERSION
The version of the program group (if not supplied by the aligned file).
String null
--PROGRAM_RECORD_ID / -PG
The program group ID of the aligner (if not supplied by the aligned file).
String null
--QUIET / NA
Whether to suppress job-summary info on System.err.
Boolean false
--READ1_ALIGNED_BAM / -R1_ALIGNED
SAM or BAM file(s) with alignment data from the first read of a pair.
Exclusion: This argument cannot be used at the same time as ALIGNED_BAM
.
List[File] []
--READ1_TRIM / -R1_TRIM
The number of bases trimmed from the beginning of read 1 prior to alignment
int 0 [ [ -∞ ∞ ] ]
--READ2_ALIGNED_BAM / -R2_ALIGNED
SAM or BAM file(s) with alignment data from the second read of a pair.
Exclusion: This argument cannot be used at the same time as ALIGNED_BAM
.
List[File] []
--READ2_TRIM / -R2_TRIM
The number of bases trimmed from the beginning of read 2 prior to alignment
int 0 [ [ -∞ ∞ ] ]
--REFERENCE_SEQUENCE / -R
Reference sequence file.
R File null
--showHidden / -showHidden
display hidden arguments
boolean false
--SORT_ORDER / -SO
The order in which the merged reads should be output.
The --SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values:
- unsorted
- queryname
- coordinate
- duplicate
- unknown
SortOrder coordinate
--TMP_DIR / NA
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
--UNMAP_CONTAMINANT_READS / -UNMAP_CONTAM
Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial sample),and unmap + label those reads accordingly.
boolean false
--UNMAPPED_BAM / -UNMAPPED
Original SAM or BAM file of unmapped reads, which must be in queryname order.
R File null
--UNMAPPED_READ_STRATEGY / NA
How to deal with alignment information in reads that are being unmapped (e.g. due to cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true
The --UNMAPPED_READ_STRATEGY argument is an enumerated type (UnmappingReadStrategy), which can have one of the following values:
- COPY_TO_TAG
- DO_NOT_CHANGE
- MOVE_TO_TAG
UnmappingReadStrategy DO_NOT_CHANGE
--USE_JDK_DEFLATER / -use_jdk_deflater
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
--USE_JDK_INFLATER / -use_jdk_inflater
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
--VALIDATION_STRINGENCY / NA
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
- STRICT
- LENIENT
- SILENT
ValidationStringency STRICT
--VERBOSITY / NA
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
- ERROR
- WARNING
- INFO
- DEBUG
LogLevel INFO
--version / NA
display the version number for this tool
boolean false
GATK version 4.0.1.2 built at 02-02-2019 02:02:55.
0 comments
Please sign in to leave a comment.