Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Comparison between ReadsPipelineSpark and MarkDuplicates

0

7 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hello AMN,

    Could you be more specific about where you are seeing these results? Which version is marking more or less reads?

    Here is a document with the information we need to look into an abnormal result request: https://gatk.broadinstitute.org/hc/en-us/articles/360053424571-How-to-Write-a-Post

    Sorry that we did not answer the question from the other post, we only provide active support in our forum.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    AMN

    Hi Genevieve, 

    Thank you for your reply. I have generated a small table comparing the results of each (one using GATK 4.1.9.0 MarkDuplicates, the other using GATK 4.1.9.0 ReadsPipelineSpark). here's a readout of the results:

    sample	totalreads	totalreads_ReadsPipelineSpark	duplicates	duplicates_ReadsPipelineSpark
    sampleA-N 324279 324279 10571 10622
    sampleA-T 382715 382715 72638 73268
    sampleB-N 309270 309270 10589 10660
    sampleB-T 315570 315570 72305 72825

    it seems that in this set of results, there are consistently more duplicates in the bam produced by ReadsPipelineSpark. The java version is

    $ java -version
    openjdk version "1.8.0_242"
    OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
    OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

    i have already shared my command-line parameters. The bams are just small bams extracted from larger WES, but we do not necessarily expect them to be representative of WES. I have looked at the documentation for `ReadsPipelineSpark` to see if there's any duplicate-related parameters, and it looks like the only one is `--duplicates-scoring-strategy`. The default for this parameters in both tools is `SUM_OF_BASE_QUALITIES`, so unless the documentation is incorrect, i cannot improve it. 

    I have not had a chance to look at whether the duplicate reads of MarkDuplicates is a perfect subset of those marked by ReadsPipelineSpark. I also have not looked at all the other ways in which the bams are different yet. I'm hoping to hear whether GATK can reproduce a similar difference with other bams.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi AMN,

    Thank you for this information! Could you do the test with MarkDuplicatesSpark, since it is not a BETA tool, and also is supposed to match the output of MarkDuplicates?

    Thanks,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    AMN

    Hi Genevieve, 

    Thanks for your reply. I tried MarkDuplicatesSpark and the results seem to match ReadsPipelineSpark rather than our implementation of MarkDuplicates. 

    sample totalreads_MarkDuplicatesSpark duplicates_MarkDuplicatesSpark
    sampleA-N 324279 10622
    sampleA-T 382715 73268
    sampleB-N 309270 10660
    sampleB-T 315570 72825

    As a side-note, I confirmed that my input bams were coordinate-sorted, but they are sorted through `samtools merge`. So to test whether slight differences in sort order were the reason for different results, I ran picard's SortSam on SampleA-N, then ran (the non-spark version of) MarkDuplicates, and still got 10571 duplicates. I confirmed that the bam produced by SortSam had the same read order as the bam produced by MarkDuplicatesSpark, so I do not believe that sorting order is the reason for the difference in duplicate count. Removing the `ASSUME_SORT_ORDER` parameter of MarkDuplicates did not change the number of called duplicates from that tool either. Any news on your end? 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi AMN,

    Could you confirm what sort order you had as output when you ran SortSam and then Picard MarkDuplicates?

    I checked with my team and found that the sort order does make a difference. When the bam is coordinate sorted, Picard MarkDuplicates does not mark unmapped mates and secondary/supplementary aligned reads with their parents. MarkDuplicatesSpark sorts the bam to be queryname sorted, so will only have identical results to Picard MarkDuplicates when the inputs are queryname sorted.

    Could you re-run the trial with queryname sorted inputs?

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    AMN

    Sure, I took a single bam and produced two sorted bams with SortSam, one queryname-sorted and the other coordinate-sorted. I ran MarkDuplicates on each one the same way except for the `--ASSUME_SORT_ORDER` parameter. When running it on the coordinate-sorted bam I matched the previous output for MarkDuplicates (10571 for sampleA-N) and when running with the queryname-sorted bam I matched the previous output for MarkDuplicatesSpark (10622 for sampleA-N).

    I think this solves the mystery of the difference for me. I made a mistake in understanding that MarkDuplicatesSpark was the same as MarkDuplicates, just different hardware. Perhaps the documentation for ReadsPipelineSpark should point to MarkDuplicatesSpark rather than MarkDuplicates, but that's just my two cents. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update! That makes sense, glad we were able to figure out why you were seeing differences.

    I think that some of the issues in the documentation can be attributed to the BETA status of the pipeline. Once it is tested and documented, we will write a more clear description.

    1
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk