Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

IlluminaBasecallsToSam: How do I specify one unique RGID per output uBAM?

0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    ISmolicz have you considered using the BARCODES_DIR file that is created with ExtractIlluminaBarcodes?

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for your reply Genevieve Brandt.

    Please could you clarify what you mean by using the BARCODES_DIR file? I understand ExtractIlluminaBarcodes creates a metrics file and multiple _barcode.txt files per lane, which are within BARCODES_DIR, but I do not understand how these could assist with appropriately defining the RGID in each uBAM.

    Thank you again.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    ISmolicz You are right, I believe I misread your post before. I will follow up when I have more information.

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you Genevieve Brandt. This is important for all downstream processes as it is the read groups set in IlluminaBasecallsToSam that are carried forward and considered with MergeBamAlignment and beyond.

    0
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    The documentation needs updating...you cannot specify the Readgroup ID in the LIBRARY_PARAMS file. 

    you can specify (one) Readgroup ID via the `

    READ_GROUP_ID

    parameter, but it will be the same id for all the resulting uBams. I am not aware of a way to speficy different RGID's for the different ubams, without running the tool multiple times.

    But I have to ask, why do you care what the RGID is?

    You shouldn't store important information or index your data with that ideantifier, since it isn't build for that purpose. picard (and other) tools will modify the RGID at will and will not honor it. if you need to identify your readgroup, I suggest you use the PU field instead, however make sure that you pick a value that is truly, globally unique.

    The sam spec (https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf) specifically says that readgroup ids may be modified when merging sam files....

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for your input Yossi Farjoun.

    I do have unique PU fields set per read group. However, in the Read groups documentation, it states that 'each read group's ID must be unique' and this was also confirmed by Genevieve Brandt in a separate post (https://gatk.broadinstitute.org/hc/en-us/community/posts/360074345431/comments/360013413052).

    The documentation also states the 'tag identifies which read group each read belongs to' but if RGID is not unique, the RG:Z tag will not identify the read group for each read and will not differ across uBAMs and downstream files.

    Please could you clarify RGID's purpose if it is not unique? Can one still be confident that the correct reads per read group are assigned to the correct uBAM per read group if all other data is appropriately set (PU, LB, SM, LB etc.) but RGID is not unique?

    Thank you again.

    0
    Comment actions Permalink
  • Avatar
    Yossi Farjoun

    I realize that this is a documentation issue and will work to fix this in the spec.

     

    RGID should indeed be unique, but only within in each file. Since the IBCTS program emits multiple files, they will all have the same RGID, but they will also all only contain a single readgroup.

     

    The flipside of this weak requirement is that RGID should not be considered immutable: If a downstream tool needs to merge two files that contain the same RGID (with different PU, presumably) the tool is allowed to modify one or both of the RGIDs to maintain uniqueness (within the file). It shouldn't modify the PU fields. 

    In short, the tool chain is set-up so that you do not need to specifically set the RGID. each uBAM will get an id, and when you merge (for example via Mark Duplicates) the output RGIDs will have been "magically" modified so that you still have different readgroups with different RGIDs.

     

    Does this clarify?

     

     

     

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for your reply Yossi Farjoun - your explanation is helpful.

    Just to confirm, following IlluminaBasecallsToSam (Picard), the uBAMs and downstream files per read group prior to merging should result in a unique RGID and RG header per file? However, when files are merged, the output will include multiple RG headers for the different read groups and either: 1) modified RGIDs to maintain uniqueness per read group, if all inputs had the same RGID, or 2) unmodified RGIDs, if already unique across input files?

    Thank you again for your help.

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk