Comparison between ReadsPipelineSpark and MarkDuplicates
GATK version: 4.1.0.0 and 4.1.9.0
Comparing two processes:
```
gatk MarkDuplicates \
--TMP_DIR ./ \
--MAX_RECORDS_IN_RAM 50000 \
--INPUT ${inputBam} \
--METRICS_FILE ${idSample}.metrics \
--ASSUME_SORT_ORDER coordinate \
--CREATE_INDEX true \
--OUTPUT ${idSample}.markduplicates.bam
```
and the following:
```
gatk ReadsPipelineSpark \
-I ${inputBam} \
-R ${genomeFile} \
--known-sites ${dbsnp} \
${known_sites} \
--verbosity INFO \
--output-bam ${idSample}.bam \
-O output.vcf \
--do-not-run-physical-phasing
```
when i compare the bams of each process, i see that there are the same amount of reads but a different amount of reads marked as duplicate (assessed with `samtools view -F 1024`). ReadsPipelineSpark is presumably a wrapper around MarkDuplicates and downstream tools, so why am I getting different results?
I previously asked this question here and did not get a response.
-
Hello AMN,
Could you be more specific about where you are seeing these results? Which version is marking more or less reads?
Here is a document with the information we need to look into an abnormal result request: https://gatk.broadinstitute.org/hc/en-us/articles/360053424571-How-to-Write-a-Post
Sorry that we did not answer the question from the other post, we only provide active support in our forum.
Thank you,
Genevieve
-
Hi Genevieve,
Thank you for your reply. I have generated a small table comparing the results of each (one using GATK 4.1.9.0 MarkDuplicates, the other using GATK 4.1.9.0 ReadsPipelineSpark). here's a readout of the results:
sample totalreads totalreads_ReadsPipelineSpark duplicates duplicates_ReadsPipelineSpark
sampleA-N 324279 324279 10571 10622
sampleA-T 382715 382715 72638 73268
sampleB-N 309270 309270 10589 10660
sampleB-T 315570 315570 72305 72825it seems that in this set of results, there are consistently more duplicates in the bam produced by ReadsPipelineSpark. The java version is
$ java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)i have already shared my command-line parameters. The bams are just small bams extracted from larger WES, but we do not necessarily expect them to be representative of WES. I have looked at the documentation for `ReadsPipelineSpark` to see if there's any duplicate-related parameters, and it looks like the only one is `--duplicates-scoring-strategy`. The default for this parameters in both tools is `SUM_OF_BASE_QUALITIES`, so unless the documentation is incorrect, i cannot improve it.
I have not had a chance to look at whether the duplicate reads of MarkDuplicates is a perfect subset of those marked by ReadsPipelineSpark. I also have not looked at all the other ways in which the bams are different yet. I'm hoping to hear whether GATK can reproduce a similar difference with other bams. -
Hi AMN,
Thank you for this information! Could you do the test with MarkDuplicatesSpark, since it is not a BETA tool, and also is supposed to match the output of MarkDuplicates?
Thanks,
Genevieve
-
Hi Genevieve,
Thanks for your reply. I tried MarkDuplicatesSpark and the results seem to match ReadsPipelineSpark rather than our implementation of MarkDuplicates.
sample totalreads_MarkDuplicatesSpark duplicates_MarkDuplicatesSpark
sampleA-N 324279 10622
sampleA-T 382715 73268
sampleB-N 309270 10660
sampleB-T 315570 72825As a side-note, I confirmed that my input bams were coordinate-sorted, but they are sorted through `samtools merge`. So to test whether slight differences in sort order were the reason for different results, I ran picard's SortSam on SampleA-N, then ran (the non-spark version of) MarkDuplicates, and still got 10571 duplicates. I confirmed that the bam produced by SortSam had the same read order as the bam produced by MarkDuplicatesSpark, so I do not believe that sorting order is the reason for the difference in duplicate count. Removing the `ASSUME_SORT_ORDER` parameter of MarkDuplicates did not change the number of called duplicates from that tool either. Any news on your end?
-
Hi AMN,
Could you confirm what sort order you had as output when you ran SortSam and then Picard MarkDuplicates?
I checked with my team and found that the sort order does make a difference. When the bam is coordinate sorted, Picard MarkDuplicates does not mark unmapped mates and secondary/supplementary aligned reads with their parents. MarkDuplicatesSpark sorts the bam to be queryname sorted, so will only have identical results to Picard MarkDuplicates when the inputs are queryname sorted.
Could you re-run the trial with queryname sorted inputs?
Genevieve
-
Sure, I took a single bam and produced two sorted bams with SortSam, one queryname-sorted and the other coordinate-sorted. I ran MarkDuplicates on each one the same way except for the `--ASSUME_SORT_ORDER` parameter. When running it on the coordinate-sorted bam I matched the previous output for MarkDuplicates (10571 for sampleA-N) and when running with the queryname-sorted bam I matched the previous output for MarkDuplicatesSpark (10622 for sampleA-N).
I think this solves the mystery of the difference for me. I made a mistake in understanding that MarkDuplicatesSpark was the same as MarkDuplicates, just different hardware. Perhaps the documentation for ReadsPipelineSpark should point to MarkDuplicatesSpark rather than MarkDuplicates, but that's just my two cents.
-
Thanks for the update! That makes sense, glad we were able to figure out why you were seeing differences.
I think that some of the issues in the documentation can be attributed to the BETA status of the pipeline. Once it is tested and documented, we will write a more clear description.
Please sign in to leave a comment.
7 comments