IlluminaBasecallsToSam: How do I specify one unique RGID per output uBAM?
Dear GATK Team,
When running IlluminaBasecallsToSam, I submit a LIBRARY_PARAMS file with the following headers, with the aim of a different ID being inserted into the RG header and tag per read group and therefore, per uBAM generated:
SAMPLE_ALIAS LIBRARY_NAME BARCODE_1 OUTPUT ID
However, although the ID I have specified is inserted into the header of each uBAM file generated per read group, this is in addition to an automated ID generated by IlluminaBasecallsToSam, leading to two ID fields per uBAM header.
The automated RGID is inserted into every uBAM header and tag and is the same in each uBAM. However, the ID I have specified in LIBRARY_PARAMS is unique to each uBAM but only inserted in the header. This leads to an error with ValidateSamFile and downstream processes.
ValidateSamFile output:
## HISTOGRAM java.lang.String
Error Type Count
ERROR:HEADER_TAG_MULTIPLY_DEFINED 1
How do I replace the automated ID set by IlluminaBasecallsToSam to ensure that only one RGID is inserted per read group but also ensuring it is the RGID set in the LIBRARY_PARAMS file, unique per read group?
I am using Picard version 2.23.8. The IlluminaBasecallsToSam option READ_GROUP_ID is currently set at ‘null’ and therefore, I suspect this could be contributing to the issue. However, due to having more than one read group, I do not know how to instruct READ_GROUP_ID to refer to the LIBRARY_PARAMS file and set one unique ID per read group.
Thank you for your time and help.
Kind regards.
-
ISmolicz have you considered using the BARCODES_DIR file that is created with ExtractIlluminaBarcodes?
-
Thank you for your reply Genevieve Brandt.
Please could you clarify what you mean by using the BARCODES_DIR file? I understand ExtractIlluminaBarcodes creates a metrics file and multiple _barcode.txt files per lane, which are within BARCODES_DIR, but I do not understand how these could assist with appropriately defining the RGID in each uBAM.
Thank you again.
-
ISmolicz You are right, I believe I misread your post before. I will follow up when I have more information.
-
Thank you Genevieve Brandt. This is important for all downstream processes as it is the read groups set in IlluminaBasecallsToSam that are carried forward and considered with MergeBamAlignment and beyond.
-
The documentation needs updating...you cannot specify the Readgroup ID in the LIBRARY_PARAMS file.
you can specify (one) Readgroup ID via the `
READ_GROUP_ID
parameter, but it will be the same id for all the resulting uBams. I am not aware of a way to speficy different RGID's for the different ubams, without running the tool multiple times.
But I have to ask, why do you care what the RGID is?
You shouldn't store important information or index your data with that ideantifier, since it isn't build for that purpose. picard (and other) tools will modify the RGID at will and will not honor it. if you need to identify your readgroup, I suggest you use the PU field instead, however make sure that you pick a value that is truly, globally unique.
The sam spec (https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf) specifically says that readgroup ids may be modified when merging sam files....
-
Thank you for your input Yossi Farjoun.
I do have unique PU fields set per read group. However, in the Read groups documentation, it states that 'each read group's ID must be unique' and this was also confirmed by Genevieve Brandt in a separate post (https://gatk.broadinstitute.org/hc/en-us/community/posts/360074345431/comments/360013413052).
The documentation also states the 'tag identifies which read group each read belongs to' but if RGID is not unique, the
RG:Z
tag will not identify the read group for each read and will not differ across uBAMs and downstream files.Please could you clarify RGID's purpose if it is not unique? Can one still be confident that the correct reads per read group are assigned to the correct uBAM per read group if all other data is appropriately set (PU, LB, SM, LB etc.) but RGID is not unique?
Thank you again.
-
I realize that this is a documentation issue and will work to fix this in the spec.
RGID should indeed be unique, but only within in each file. Since the IBCTS program emits multiple files, they will all have the same RGID, but they will also all only contain a single readgroup.
The flipside of this weak requirement is that RGID should not be considered immutable: If a downstream tool needs to merge two files that contain the same RGID (with different PU, presumably) the tool is allowed to modify one or both of the RGIDs to maintain uniqueness (within the file). It shouldn't modify the PU fields.
In short, the tool chain is set-up so that you do not need to specifically set the RGID. each uBAM will get an id, and when you merge (for example via Mark Duplicates) the output RGIDs will have been "magically" modified so that you still have different readgroups with different RGIDs.
Does this clarify?
-
Thank you for your reply Yossi Farjoun - your explanation is helpful.
Just to confirm, following IlluminaBasecallsToSam (Picard), the uBAMs and downstream files per read group prior to merging should result in a unique RGID and RG header per file? However, when files are merged, the output will include multiple RG headers for the different read groups and either: 1) modified RGIDs to maintain uniqueness per read group, if all inputs had the same RGID, or 2) unmodified RGIDs, if already unique across input files?
Thank you again for your help.
Please sign in to leave a comment.
8 comments