MarkDuplicatesSparks not removing some duplicates?
I'm attempting to remove duplicates for some TST170 sequences but end up with what appear to be identical read sequences. For this sequence run, the PCR duplicates were quite high (21K) with MarkDuplicates reducing down to a few hundred coverage. However, some of the resulting reads have identical 5-prime ends. Re running markDuplicates on the already deduplicated data did not alter the results.
For demonstration purposes, here are the fastq for two of several identical reads that were not filtered out:
@NB551226:81:H273NBGXB:2:22111:10308:4079 2:N:0:CATCGAGG+ACGTCCTG
AGGTGTCTTCTAACGCTGATGCAGCTTGACAGCTGAATAATTTTGTGGGAGCTAGCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTG
+
AAA6A/EEEEEEEE6EEEEEEEEEEEEE6E/EEEE/EEEEEEE/EEE/EEEEA/EEEAEAEE6EEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEE<EEEEE
@NB551226:81:H273NBGXB:3:22406:1443:19094 2:N:0:CATCGAGG+ACGTCCTG
AGGTGTCTTCTAACGCTGATGCAGCTTGACAGCTGAATAATTTTGTGGGAGCTAGCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTG
+
/A///EE/6E6AEA//E/EEEEEE/EE6EEA/EE6/EE//E/EEEEEE6EE/EE/A//EEEE/E/E/EEEA/EEE//AEAEE/<EA///6AEEEAE/A/E/
@NB551226:81:H273NBGXB:2:22111:10308:4079 1:N:0:CATCGAGG+ACGTCCTG
TATAAGTTGTATAAAATACTTCTCCATTTTGTTGATTCATAAGCACCTGTAATCCAAAAAATTAAAATGTAATTTTTATGTATGTACTCTGTTTACACTGC
+
AAAAAAE/EEEEEEE6E6EEEEAEEE//EAE/EEEEEEEEEEEEEAAE/E/EEA<EEAAAAEAAEEEEEEAAAEEEEAA<A/EEE6EAEE/EE/EEAEEAA
@NB551226:81:H273NBGXB:3:22406:1443:19094 1:N:0:CATCGAGG+ACGTCCTG
TATAAGTTGTATAAAATACTTCTCCATTTTGTTGATTCATAAGCACCTGTAATCCAAAAAATTAAAATGTAATTTTTATGTATGTACTCTGTTTACACTGC
+
AAAAAE/EAEEAEEEEEEEAAAAA/E/EEEEEAAEAEEEEEEEEA/EE6EEEA/EE//EEEE/EEEE/EEEEE/EEEEE6A/AE<EAAAEEEEAE///EE/
Here are the resulting MarkDuplicate BAM information:
NB551226:81:H273NBGXB:3:22406:1443:19094 83 chr11 108117636 60 101M = 108117582 -155 GCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTGGATTACAGGTGCTTATGAATCAACAAAATGGAGAAGTATTTTATACAACTTATA /EE///EAEEEEAAAE<EA/A6EEEEE/EEEEE/EEEE/EEEE//EE/AEEE6EE/AEEEEEEEEAEAAEEEEE/E/AAAAAEEEEEEEAEEAE/EAAAAA XA:Z:ATM,-281,3S98M,2; MC:Z:101M MD:Z:101 RG:Z:BR1 NM:i:0 AS:i:101 XS:i:88
NB551226:81:H273NBGXB:2:22111:10308:4079 83 chr11 108117636 60 101M = 108117582 -155 GCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTGGATTACAGGTGCTTATGAATCAACAAAATGGAGAAGTATTTTATACAACTTATA AAEEAEE/EE/EEAE6EEE/A<AAEEEEAAAEEEEEEAAEAAAAEE<AEE/E/EAAEEEEEEEEEEEEE/EAE//EEEAEEEE6E6EEEEEEE/EAAAAAA XA:Z:ATM,-281,3S98M,2; MC:Z:101M MD:Z:101 RG:Z:BR1 NM:i:0 AS:i:101 XS:i:88
Here is an IGV view with the two duplicates highlighted:
Can you please provide
a) GATK version used
gatk-4.1.7.0
b) Exact GATK commands used
$gatk4 MarkDuplicatesSpark \
-I "${aligned}/${fp}.bam" \
-O "${deduped}/${fp}_MD.bam" \
-M "${deduped}/${fp}_marked_dup_metrics.txt" \
--optical-duplicate-pixel-distance 2500 \
--remove-sequencing-duplicates \
--conf 'spark.executor.cores=50'
c) The entire error log if applicable.
NA
This result also happens with the regular MarkDuplicates tool.
-
Hi Tom Morrison,
Just curious, does it get removed if you use "remove all duplicates" ?
--remove-all-duplicates - If true do not write duplicates to the output file instead of writing them with appropriate flags set. --remove-sequencing-duplicates - If true do not write optical/sequencing duplicates to the output file instead of writing them with appropriate flags set. -
Hi Tiffany,
Unfortunately, neither of those options eliminated the duplicates.
-
Hey Tom! Ok, I will need to ask a few teammates and will report back.
-
Can you extract all the reads with these two querynames NB551226:81:H273NBGXB:3:22406:1443:19094 and NB551226:81:H273NBGXB:2:22111:10308:4079 share it with us?
-
Hi Tiffany,
I've pasted the pre mark duplicate bam reads for the requested querynames below. The FASTA and markduplicate reads for these querynames was in my original post.
NB551226:81:H273NBGXB:3:22406:1443:19094 83 chr11 108117636 60 101M = 108117582 -155 GCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTGGATTACAGGTGCTTATGAATCAACAAAATGGAGAAGTATTTTATACAACTTATA /EE///EAEEEEAAAE<EA/A6EEEEE/EEEEE/EEEE/EEEE//EE/AEEE6EE/AEEEEEEEEAEAAEEEEE/E/AAAAAEEEEEEEAEEAE/EAAAAA NM:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:88 RG:Z:BR1 XA:Z:ATM,-281,3S98M,2;
NB551226:81:H273NBGXB:2:22111:10308:4079 83 chr11 108117636 60 101M = 108117582 -155 GCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTGGATTACAGGTGCTTATGAATCAACAAAATGGAGAAGTATTTTATACAACTTATA AAEEAEE/EE/EEAE6EEE/A<AAEEEEAAAEEEEEEAAEAAAAEE<AEE/E/EAAEEEEEEEEEEEEE/EAE//EEEAEEEE6E6EEEEEEE/EAAAAAA NM:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:88 RG:Z:BR1 XA:Z:ATM,-281,3S98M,2;
-
Can you show us the mates of these reads?
-
I've never outputted separate reads from bwa, I'd have to look into this if this is what you want. The FASTA rows in my original post do have read 1 and 2 for these querynames.
-
Hi Tiffany,
Above, I kept writing FASTA when I meant FASTQ. Also, I did not see a way to configure BWA to output the align reads independently. Do you know what setting I would use? Alternatively, I can do a BMA to FASTQ conversion and provide you the query name results.
-
Hi Tom, We weren't able to replicate your results (aligned fastq with bwa and ran MD) - everything looked as expected. Could you share your pre- Mark Duplicates bam by following these instructions?
-
Hi Tiffany,
I uploaded the zipped file: MD_Issue_200701.zip, the contents are:
Files created from markDuplicates; AgilentMale-BL-R-1-B-190508_S1_MD.bam, AgilentMale-BL-R-1-B-190508_S1_marked_dup_metrics.txt , out.txt, AgilentMale-BL-R-1-B-190508_S1_MD.bam.bai, AgilentMale-BL-R-1-B-190508_S1_MD.bam.sbi
Modified reference genome: hg19_mix4.fasta.dict , hg19_mix4.fasta.fai and hg19_mix4.fasta
Input BAM file: AgilentMale-BL-R-1-B-190508_S1.bam
Command line used: command_line.txt
-
Hi Tom Morrison
I have a couple of suggestions for you to try,
- Can you please try to run it with MarkDuplicates instead of MarkDuplicatesSpark and see if that makes any difference?
- Try to run the data without --optical-duplicate-pixel-distance 2500 argument. I wonder if this value is too high. And lets use all default values and the remove duplicates arguments, and so your command would look like:
$gatk4 MarkDuplicatesSpark \
-I "${aligned}/${fp}.bam" \
-O "${deduped}/${fp}_MD.bam" \
-M "${deduped}/${fp}_marked_dup_metrics.txt" \
--REMOVE-DUPLICATES TRUE\
--REMOVE_SEQUENCING_DUPLICATES TRUE\- The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. So to determine if these reads qualify as duplicates we will need to look at the alignment of the mate pairs. Can you please provide the mate pairs alignments of the reads you think should be filtered out. You can use samtools to extract mate pairs, please look up samtools documentation for more info. Note: QNAME IDs will be identical for mates. Please look up SAM file format specifications for more information.
-
Hi Bahnu,
using non spark MarkDuplicates did not alter the results.
Removing the --optical-duplicate-pixel-distance 2500 argument did not alter the result.
Have you attempted to reproduce the results with my uploaded files?
Also, FYI, --REMOVE-DUPLICATES TRUE and --REMOVE-DUPLICATES TRUE need to be in lower case and the two suggested options can not be run at same time.
-
Hi Tom Morrison
We reproduce results on our end once we have exhausted all other options. Can you please provide the mate pairs alignments of the reads you think should be filtered out? As mentioned above, MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. The reason I asked for more information on the mate pairs is because maybe they don't match and that is why those reads are not getting filtered out.
Please refer to do the tool docs for the exact arguments:
https://gatk.broadinstitute.org/hc/en-us/articles/360045681971-MarkDuplicates-Picard-#--REMOVE_DUPLICATES
https://gatk.broadinstitute.org/hc/en-us/articles/360045681971-MarkDuplicates-Picard-#--REMOVE_SEQUENCING_DUPLICATES -
Hi Bhanu,
Two example read pairs are in my first post. Both of their fastq R1 & R2, their bam output from markduplicates, and their appearance in IGV. Their 5-prime sequences are identical (their entire sequences are identical) Did you want some additional information?
-
Hi Tom Morrison
My sincere apologies for the delay in getting back to you. This question fell off my radar. I will however prioritize it now.
I took a look at the data you shared with us. I am looking for the sequence in the "AgilentMale-BL-R-1-B-190508_S1.bam" file in the zip folder. I tried to search for the QNAME and the Sequence from your first post in the bam file and couldn't find either. If you are still facing this issue, can you please share the bam records of the sequences with me? You can do this by using the grep command for QNAME or SEQ on the output of "samtools view <bam>" command.
Please sign in to leave a comment.
15 comments