Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

How should I pre-process data from multiplexed sequencing and multi-library designs? Follow


  • Avatar
    Ury Alon

    Thanks for the great explanation.

    Regarding the following line:

    "Note that we used to do a first round of marking duplicates here for QC purposes but tool improvements have rendered this obsolete"

    If I am interested in the per-lane statistics (namely how many duplicates per lanes), how can they be extracted if MarkDuplicates is executed only once when merging the lanes into a single BAM?

    Looking at the metrics file (I'm using gatk v4.1.7.0), I see that the results are per library.  Does it mean that if I want per-lane statistics, I should modify the read group so each lane will have a distinct library (currently the all have the same library)?


    Comment actions Permalink
  • Avatar
    Fred Zhou

    Finally I found this thread.

    Once you have pre-processed each read group individually, 
    you merge read groups belonging to the same sample into a single BAM file.
    You can do this as a standalone step,
    bur for the sake of efficiency we combine this with the per-sample duplicate marking step
    (it's simply a matter of passing the multiple inputs to MarkDuplicates in a single command).

    This is exactly what confused me... I tried this while the MarkDuplicates will discard the RG info and I cannot proceed...

    Comment actions Permalink
  • Avatar
    Loren Cassin Sackett

    What is the recommended workflow if we study non-model organisms?  I don't have a set of known SNPs I can use in BQSR, so I was going to proceed straight to HaploCaller... but then I don't understand what comes next.  Would I use CombineGVCFs twice, once to combine lanes within a sample, and then to combine samples?  How does this influence downstream analyses, e.g., to calculate depth of coverage per individual at a specific loci, can I access that information easily or are the lanes still treated individually in the final dataset?

    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk