Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Adapter trimming: how can permanent adapter trimming, or hard clipping, be achieved when following GATK Best Practices and commencing data preprocessing with uBAM files?

0

4 comments

  • Avatar
    SkyWarrior

    This was one of my earlier questions back in the old forum and my short takeaway message from the response was that it depends to your personal taste of data integrity. I used to mark adapter sequences with base quality of 2 to keep them away from any potential interference however my later practices told me to completely remove all adapters and make sure that fastqc graph is a flatline. 

    If you wish to continue using MarkIlluminaAdapters and SamToFastq practice make sure that you remove or mark adapters and lower their base quality scores and merge the uBAM generated using the modified fastq files not the original unmodified uBAM before marking stage. 

    In general adapter cleanup is a necessary step no matter how you handle your data so don't hesitate to modify the best practices flows according to your needs when you see something does not fit your needs. 

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for your reply and advice SkyWarrior. I have read information in multiple resources and legacy forum posts and understand there is not a straightforward answer. Could I ask what it was in your later practices that led you to completely remove the adapters rather than using a base quality of 2?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Sure. When our data was completely from a single source it was a doable option. However after we acquired our own sequencer and started accumulating data from multiple resources with bunch of different practices to produce fastq files I decided to change my behavior and completely started removing all traces of adapters from the data. 

    0
    Comment actions Permalink
  • Avatar
    ISmolicz

    Thank you for the additional information SkyWarrior.

    I have previously questioned why permanent hard clipping is not a supported option within GATK workflows as there are many external recommendations and resources for quality trimming and adapter removal. I understand there is the tool ClipReads but hard clipping is not supported. 

    As it may help others in the future, I though I would collate reasoning for this based on my current understanding:

    1) With quality trimming, this is discouraged as GATK tools are quality-aware and hard clipping based on quality scores could affect BQSR (CollectBaseDistributionByCycle (Picard)), part of GATK Best Practices (Data pre-processing for variant discovery), and other quality control assessments. Therefore, quality trimming is not recommended.

    2) With adapter trimming, only soft clipping is available post-MergeBamAlignment (Picard). In current and legacy documentation, it is described how base quality scores marked by XT can be set to 2 to prevent adapter sequences from contributing to alignment (awaiting confirmation - see separate post), how BQSR will ignore the associated bases and that they will not contribute towards HaplotypeCaller variant calls. Looking at current default settings (GATK version 4.2.0.0) for BQSR, HaplotypeCaller and also Mutect2, this continues to be the case as long as the base qualities for adapters are maintained at 2 following MergeBamAlignment (Picard) (awaiting confirmation - see separate post).

    A couple of cautions:

    1) If one is analysing their data with FastQC, the FastQC Report will not necessarily reflect the above clipping settings and considerations. Therefore, it will most likely show the presence of adapters in the Adapter Content section even though these are accounted for.

    2) As SkyWarrior has mentioned, a new uBAM could be generated from FASTQ files with adapters hard clipped or base quality scores permanently changed. However, during the original uBAM -> FASTQ -> new uBAM process, certain meta information and tags will be lost and not easily added later to the new uBAM (e.g. RX tags for molecular indexes). Therefore, this must be considered depending on planned downstream analyses.

    I hope this is helpful and I look forward to hearing any further thoughts from the GATK Community.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk