Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

(How to) Map and clean up short read sequence data efficiently Follow

12 comments

  • Avatar
    Sumudu

    Dear Team,

    I have a WES illumina data set for variant calling. And both forward and reverse files have adapter read-through at the 3' end as attached fastqc report. In that case, if I do not trim those adapter parts (e.g. using a trimmer like trimmomatic), except run MarkIlluminaAdapters would be enough since this step will mark those parts with XT and later disregard in the alignment. Am I correct??

    I have fastq files(R1 & R2) and what I did was run FastqToSam with RG information to create uBAM -> MarkIlluminaAdapters -> SamToFastq -> BWA mem -> MergeBamAlignment 

    Thank you

    Best Regards

    Sumudu

     

    R1 file

    R2 file 

     

     

     

     

    2
    Comment actions Permalink
  • Avatar
    Mike Keehan

    Dear GATK team

    I wonder if you could help me. I am following these best practices to produce alignments intended for structural variant calling. The recommended workflow suggests to change the PRIMARY_ALIGNMENT_STRATEGY from the default of BestMapQ to MostDistant.   For SV calling we are essentially trying to work out the most probable insert size and surely we would want to use the pair that gave the best alignments i.e. BestMapQ(?) . I wonder if you could furthur advise on the merits of PRIMARY_ALIGNMENT_STRATEGY in regards to SV calling as many users might follow the tutorial to claim "best practise" adherence.  

    Thanks

    Mike

    0
    Comment actions Permalink
  • Avatar
    Rachel Turba

    Hello,

    I am having a curious issue when aligning my reads to the newer reference of the threespine stickleback, which has a chrY added to it. After following the steps here, I get zero reads aligned to the chrY. When I do the mapping with bowtie2, I get reads aligned.

    What is noticeable with these alignments is that whenever the chrY is present, I get a lower coverage than average for the chrX. Could BWA be removing these duplicate reads from the chrY and therefore giving me a zero coverage? I've tried checking my file before/after the MarkDuplicates step, but I get zero for both. So, I was wondering if it would have something to do with the -M option in BWA. Any help here would be appreciated.

    I am using BWA v.0.7.17.

    Thank you,

    Rachel

    0
    Comment actions Permalink
  • Avatar
    Mike Keehan

    Hi Rachel,

    Are you aligning male fish or female fish? 

    Do X and Y in three spine stickleback  behave the same as mammals (cow,humans)?

    MarkDuplicates remove optical duplicates ie. remove multiple distinct reads that uniquely map to the same position in the genome.

    With a highly repetitive Y (in mammals) the problem is that a single read can map to multiple positions in the genome.

    If you have male fish (and they behave like cows or humans) you might see a coverage change at your pseudo-autosomal regions.

    Mike

    0
    Comment actions Permalink
  • Avatar
    Rachel Turba

    Do you mean like XY and XX? I'm almost sure this is the case. So yeah, it makes sense that coverage of X would drop by half in males, duh. And they do recombine, but crossing over is suppressed in the majority of its length, so I don't really understand why I would not get any alignments to chrY.

    I'm doing alignments in both male and female.

    0
    Comment actions Permalink
  • Avatar
    Rachel Turba

    Ah, no! Nevermind… I've realized the problem was with the annotation file that did not contain the new chromosome. Sorry for this, but hopefully this will help some other lost soul like me in the future!

    0
    Comment actions Permalink
  • Avatar
    Amin

    I had to spend a lot of time reading multiple pages of GATK (and finding the proper page with broken links) to make a workable/simple BASH script that process from lanes FASTQs to deduplicated BAM files. This is useful for understanding the entire procedure.

    I leave the resulting BASH script below as it might be useful for somebody like me who wants to know exactly how this should be done (following GATK recommendations):

    SAMPLE_ID=XXX
    OUTPUT_DIR=./
    N_THREADS=6
    for LANE in {1..4}; do
    READ1=../fastqs/XXX_HLYC7DSXY_S2_L00${LANE}_R1_001.fastq.gz
    READ2=../fastqs/XXX_HLYC7DSXY_S2_L00${LANE}_R2_001.fastq.gz
    OUT_FNAME=${OUTPUT_DIR}/${SAMPLE_ID}.L${LANE}

    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL FastqToSam --FASTQ ${READ1} --FASTQ2 ${READ2} --OUTPUT ${OUT_FNAME}.unaligned.bam --SAMPLE_NAME ${SAMPLE_ID} --LIBRARY_NAME ${SAMPLE_ID}.lib --READ_GROUP_NAME ${SAMPLE_ID}.L${LANE} --SORT_ORDER queryname --TMP_DIR ${TMP_DIR}
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL MarkIlluminaAdapters --INPUT ${OUT_FNAME}.unaligned.bam --OUTPUT ${OUT_FNAME}.adaptMarked.bam --METRICS ${OUT_FNAME}.adaptMarked.metrics.txt --TMP_DIR ${TMP_DIR}
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL SamToFastq --INPUT ${OUT_FNAME}.adaptMarked.bam --FASTQ ${OUT_FNAME}.interleaved.fastq.gz --CLIPPING_ATTRIBUTE XT --CLIPPING_ACTION 2 --INTERLEAVE true --INCLUDE_NON_PF_READS true --TMP_DIR ${TMP_DIR}

    bwa mem -K 100000000 -v 3 -t ${N_THREADS} -Y -p ./reference_genomes/hg38/chrAll.fa.gz ${OUT_FNAME}.interleaved.fastq.gz | samtools view -h -b - > ${OUT_FNAME}.aligned.bam

    # Make a dictionary of information about the reference genome (only need to be done once):
    # gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR}" --spark-runner LOCAL CreateSequenceDictionary --REFERENCE ./reference_genomes/hg38/chrAll.fa.gz --OUTPUT ./reference_genomes/hg38/chrAll.fa.gz.dict
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL MergeBamAlignment --REFERENCE_SEQUENCE ./reference_genomes/hg38/chrAll.fa.gz --UNMAPPED_BAM ${OUT_FNAME}.unaligned.bam --ALIGNED_BAM ${OUT_FNAME}.aligned.bam --OUTPUT ${OUT_FNAME}.alnMerged.bam --CREATE_INDEX false --ADD_MATE_CIGAR true --CLIP_ADAPTERS true --CLIP_OVERLAPPING_READS true --INCLUDE_SECONDARY_ALIGNMENTS true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY BestMapq --ATTRIBUTES_TO_RETAIN XS

    # or: samtools sort -n -m3G --threads ${N_THREADS} [input] > [output]
    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=1" --spark-runner LOCAL SortSamSpark --input ${OUT_FNAME}.alnMerged.bam --output ${OUT_FNAME}.bam --sort-order queryname --conf "spark.local.dir=${TMPDIR}" --spark-master local[${N_THREADS}] --spark-verbosity WARN --verbosity WARNING
    done

    gatk --java-options "-Xmx8g -Xms4g -Djava.io.tmpdir=${TMP_DIR} -Dsamjdk.compression_level=5" --spark-runner LOCAL MarkDuplicatesSpark --input ${OUTPUT_DIR}/${SAMPLE_ID}.L1.bam --input ${OUTPUT_DIR}/${SAMPLE_ID}.L2.bam --input ${OUTPUT_DIR}/${SAMPLE_ID}.L3.bam --input ${OUTPUT_DIR}/${SAMPLE_ID}.L4.bam --output ${SAMPLE_ID}.deduped.bam --metrics-file ${SAMPLE_ID}.deduped.metrics.txt --spark-master local[${N_THREADS}]
    2
    Comment actions Permalink
  • Avatar
    Jonathon

    The second bullet point in step 1 appears to contain a self referential link. Any chance a link to documentation on generating a ubam from a fastq can be posted?

    2
    Comment actions Permalink
  • Avatar
    John-Hanson Machado

    I agree with Jonathon, the link appeard to direct back to this same page rather than a tutorial. I would also benefit from posting the updated link to the correct tutorial for:

    • To convert FASTQ or revert aligned BAM files, follow directions in this tutorial. The resulting uBAM needs to have its adapter sequences marked as outlined in the next step (step 2).
    1
    Comment actions Permalink
  • Avatar
    Peter Waltman

    For those confused by this tutorial, specifically, the self-referential link to generate a uBam from fastq (or aligned Bam), you can find a backup of the original tutorial here:

    https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/tutorials/6484-how-to-generate-an-unmapped-bam-from-fastq-or-aligned-bam

    2
    Comment actions Permalink
  • Avatar
    Ed Ryder

    The path for the tutorial reference files at ftp.broadinstitute.org/bundle/2.8/b37/ doesn't seem to exist anymore. Could you point me in the right direction please? 

    Thanks!

    0
    Comment actions Permalink
  • Avatar
    David Levy-Booth

    There appears to be a small typo in the "Piped command": 

    java -Xmx8G -jar /path/picard.jar SamToFastq \
    I=6483_snippet_markilluminaadapters.bam \
    FASTQ=/dev/stdout \
    CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 INTERLEAVE=true NON_PF=true \
    TMPDIR=/path/shlee | \ /path/bwa mem -M -t 7 -p /path/Homosapiens_assembly19.fasta /dev/stdin | \

    Where TMPDIR= should be TMP_DIR=

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk