Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

(How to) Map and clean up short read sequence data efficiently Follow

14 comments

  • Avatar
    Sumudu

    Dear Team,

    I have a WES illumina data set for variant calling. And both forward and reverse files have adapter read-through at the 3' end as attached fastqc report. In that case, if I do not trim those adapter parts (e.g. using a trimmer like trimmomatic), except run MarkIlluminaAdapters would be enough since this step will mark those parts with XT and later disregard in the alignment. Am I correct??

    I have fastq files(R1 & R2) and what I did was run FastqToSam with RG information to create uBAM -> MarkIlluminaAdapters -> SamToFastq -> BWA mem -> MergeBamAlignment 

    Thank you

    Best Regards

    Sumudu

     

    R1 file

    R2 file 

     

     

     

     

    2
    Comment actions Permalink
  • Avatar
    Mike Keehan

    Dear GATK team

    I wonder if you could help me. I am following these best practices to produce alignments intended for structural variant calling. The recommended workflow suggests to change the PRIMARY_ALIGNMENT_STRATEGY from the default of BestMapQ to MostDistant.   For SV calling we are essentially trying to work out the most probable insert size and surely we would want to use the pair that gave the best alignments i.e. BestMapQ(?) . I wonder if you could furthur advise on the merits of PRIMARY_ALIGNMENT_STRATEGY in regards to SV calling as many users might follow the tutorial to claim "best practise" adherence.  

    Thanks

    Mike

    0
    Comment actions Permalink
  • Avatar
    Rachel Turba

    Hello,

    I am having a curious issue when aligning my reads to the newer reference of the threespine stickleback, which has a chrY added to it. After following the steps here, I get zero reads aligned to the chrY. When I do the mapping with bowtie2, I get reads aligned.

    What is noticeable with these alignments is that whenever the chrY is present, I get a lower coverage than average for the chrX. Could BWA be removing these duplicate reads from the chrY and therefore giving me a zero coverage? I've tried checking my file before/after the MarkDuplicates step, but I get zero for both. So, I was wondering if it would have something to do with the -M option in BWA. Any help here would be appreciated.

    I am using BWA v.0.7.17.

    Thank you,

    Rachel

    0
    Comment actions Permalink
  • Avatar
    Mike Keehan

    Hi Rachel,

    Are you aligning male fish or female fish? 

    Do X and Y in three spine stickleback  behave the same as mammals (cow,humans)?

    MarkDuplicates remove optical duplicates ie. remove multiple distinct reads that uniquely map to the same position in the genome.

    With a highly repetitive Y (in mammals) the problem is that a single read can map to multiple positions in the genome.

    If you have male fish (and they behave like cows or humans) you might see a coverage change at your pseudo-autosomal regions.

    Mike

    0
    Comment actions Permalink
  • Avatar
    Rachel Turba

    Do you mean like XY and XX? I'm almost sure this is the case. So yeah, it makes sense that coverage of X would drop by half in males, duh. And they do recombine, but crossing over is suppressed in the majority of its length, so I don't really understand why I would not get any alignments to chrY.

    I'm doing alignments in both male and female.

    0
    Comment actions Permalink
  • Avatar
    Rachel Turba

    Ah, no! Nevermind… I've realized the problem was with the annotation file that did not contain the new chromosome. Sorry for this, but hopefully this will help some other lost soul like me in the future!

    0
    Comment actions Permalink
  • Avatar
    Amin

    I had to spend a lot of time reading multiple pages of GATK (and finding the proper page with broken links) to make a workable/simple BASH script that process from lanes FASTQs to deduplicated BAM files. This is useful for understanding the entire procedure.

    I leave the resulting BASH script below as it might be useful for somebody like me who wants to know exactly how this should be done (following GATK recommendations):

    SAMPLE_ID=XXX
    OUTPUT_DIR=./
    N_THREADS=6
    for LANE in {1..4}; do
    READ1=../fastqs/XXX_HLYC7DSXY_S2_L00${LANE}_R1_001.fastq.gz
    READ2=../fastqs/XXX_HLYC7DSXY_S2_L00${LANE}_R2_001.fastq.gz
    OUT_FNAME=${OUTPUT_DIR}/${SAMPLE_ID}.L${LANE}

    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL FastqToSam --FASTQ ${READ1} --FASTQ2 ${READ2} --OUTPUT ${OUT_FNAME}.unaligned.bam --SAMPLE_NAME ${SAMPLE_ID} --LIBRARY_NAME ${SAMPLE_ID}.lib --READ_GROUP_NAME ${SAMPLE_ID}.L${LANE} --SORT_ORDER queryname --TMP_DIR ${TMP_DIR}
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL MarkIlluminaAdapters --INPUT ${OUT_FNAME}.unaligned.bam --OUTPUT ${OUT_FNAME}.adaptMarked.bam --METRICS ${OUT_FNAME}.adaptMarked.metrics.txt --TMP_DIR ${TMP_DIR}
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL SamToFastq --INPUT ${OUT_FNAME}.adaptMarked.bam --FASTQ ${OUT_FNAME}.interleaved.fastq.gz --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 --INTERLEAVE true --INCLUDE_NON_PF_READS true --TMP_DIR ${TMP_DIR}

    bwa mem -K 100000000 -v 3 -t ${N_THREADS} -Y -p ./reference_genomes/hg38/chrAll.fa.gz ${OUT_FNAME}.interleaved.fastq.gz | samtools view -h -b - > ${OUT_FNAME}.aligned.bam

    # Make a dictionary of information about the reference genome (only need to be done once):
    # gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR}" --spark-runner LOCAL CreateSequenceDictionary --REFERENCE ./reference_genomes/hg38/chrAll.fa.gz --OUTPUT ./reference_genomes/hg38/chrAll.fa.gz.dict
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL MergeBamAlignment --REFERENCE_SEQUENCE ./reference_genomes/hg38/chrAll.fa.gz --UNMAPPED_BAM ${OUT_FNAME}.unaligned.bam --ALIGNED_BAM ${OUT_FNAME}.aligned.bam --OUTPUT ${OUT_FNAME}.alnMerged.bam --CREATE_INDEX false --ADD_MATE_CIGAR true --CLIP_ADAPTERS true --CLIP_OVERLAPPING_READS true --INCLUDE_SECONDARY_ALIGNMENTS true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY BestMapq --ATTRIBUTES_TO_RETAIN XS

    # or: samtools sort -n -m3G --threads ${N_THREADS} [input] > [output]
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL SortSamSpark --input ${OUT_FNAME}.alnMerged.bam --output ${OUT_FNAME}.bam --sort-order queryname --conf "spark.local.dir=${TMPDIR}" --spark-master local[${N_THREADS}] --spark-verbosity WARN --verbosity WARNING
    done

    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=5" --spark-runner LOCAL MarkDuplicatesSpark --input ${OUTPUT_DIR}/${SAMPLE_ID}.L1.bam --input ${OUTPUT_DIR}/${SAMPLE_ID}.L2.bam --input ${OUTPUT_DIR}/${SAMPLE_ID}.L3.bam --input ${OUTPUT_DIR}/${SAMPLE_ID}.L4.bam --output ${SAMPLE_ID}.deduped.bam --metrics-file ${SAMPLE_ID}.deduped.metrics.txt --spark-master local[${N_THREADS}]
    4
    Comment actions Permalink
  • Avatar
    Jonathon

    The second bullet point in step 1 appears to contain a self referential link. Any chance a link to documentation on generating a ubam from a fastq can be posted?

    4
    Comment actions Permalink
  • Avatar
    John-Hanson Machado

    I agree with Jonathon, the link appeard to direct back to this same page rather than a tutorial. I would also benefit from posting the updated link to the correct tutorial for:

    • To convert FASTQ or revert aligned BAM files, follow directions in this tutorial. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
    2
    Comment actions Permalink
  • Avatar
    Peter Waltman

    For those confused by this tutorial, specifically, the self-referential link to generate a uBam from fastq (or aligned Bam), you can find a backup of the original tutorial here:

    https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/tutorials/6484-how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam

    1
    Comment actions Permalink
  • Avatar
    Ed Ryder

    The path for the tutorial reference files at ftp.broadinstitute.org/bundle/2.8/b37/ doesn't seem to exist anymore. Could you point me in the right direction please? 

    Thanks!

    1
    Comment actions Permalink
  • Avatar
    David Levy-Booth

    There appears to be a small typo in the "Piped command": 

    java -Xmx8G -jar /path/picard.jar SamToFastq \
    I=6483_snippet_markilluminaadapters.bam \
    FASTQ=/dev/stdout \
    CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
    TMPDIR=/path/shlee | \ /path/bwa mem -M -t 7 -p /path/Homosapiens_assembly19.fasta /dev/stdin | \

    Where TMPDIR= should be TMP_DIR=

    1
    Comment actions Permalink
  • Avatar
    Augustin Chen

    Dear GATK team,

    Thank you very much for this very complete tutorial!
    I am trying to implement a pipeline with the latest GATK best practices for pre-processing. For that, I downloaded the latest version of several preprocessing pipelines released on the "gatk-workflows" and "broadinstitute/warp" GitHub repo (see list below). However, I notice some differences with this tutorial like the absence of the command "MarkIlluminaAdapters" in all of them.

    1. Could you please confirm that this step is not implemented in those WDL releases and if possible why so?
    2. I am building a pre-processing pipeline to call somatic variants from WES and WGS using Mutect2 and other tools. In this context, would you recommend me to include MarkIlluminaAdapters in the pre-processing pipeline or not?
    3. Would you have a latest, GATK best-practices pipeline for data pre-processing for somatic variant calling to suggest? I have been looking for several weeks but keep finding discrepancies between the latest GATK pre-processing guidelines like the one I describe above.

    Thank you in advance!

    List of GATK preprocessing pipelines:
    [1] https://github.com/gatk-workflows/gatk4-data-processing/releases/tag/2.1.1 (Apr 1, 2021)
    [2] https://github.com/broadinstitute/warp/releases/tag/ExomeGermlineSingleSample_v3.1.8 (Nov 30, 2022)
    [3] https://github.com/broadinstitute/warp/releases/tag/GDCWholeGenomeSomaticSingleSample_v1.0.1 (Mar 31, 2021) # For this one I considered that it is actually the GDC pipeline and might differ from GATK best practices but still had a look to compare their approach with your pipelines

    1
    Comment actions Permalink
  • Avatar
    Augustin Chen

    Dear Derek Caetano-Anolles,

    I am a bit confused by the recommendation for multiplexed samples to run MarkDuplicates twice, before and after merging:

    "For multiplexed samples, first perform the workflow steps on a file representing one sample and one lane. Then mark duplicates. Later, after some steps in the GATK's variant discovery workflow, and after aggregating files from the same sample from across lanes into a single file, mark duplicates again. These two marking steps ensure you find both optical and PCR duplicates."

    In fact, this GATK tutorial on pre-processing multiplexed data (last edited by you on January 10, 2023, 11:35) recommends merging the BAM files by running them through MarkDuplicates and thus using MarkDuplicates only once. Also, from my understanding of MarkDuplicates, it should be able to distinguish PCR and optical duplicates from the lane information contained in the RG or PU tags (i.e. duplicate reads that were not sequenced on the same lane cannot be optical duplicates).

    Could you please clarify those discrepancies?
    Thank you in advance!

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk