Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

MarkDuplicatesSparks not removing some duplicates?

0

15 comments

  • Avatar
    Tiffany Miller

    Hi Tom Morrison,

    Just curious, does it get removed if you use "remove all duplicates" ?

    --remove-all-duplicates - If true do not write duplicates to the output file instead of writing them with appropriate flags set.
    --remove-sequencing-duplicates - If true do not write optical/sequencing duplicates to the output file instead of writing them with appropriate flags set.
    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    Hi Tiffany,

    Unfortunately, neither of those options eliminated the duplicates.

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hey Tom! Ok, I will need to ask a few teammates and will report back. 

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Can you extract all the reads with these two querynames  NB551226:81:H273NBGXB:3:22406:1443:19094 and NB551226:81:H273NBGXB:2:22111:10308:4079 share it with us?

    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    Hi Tiffany,

    I've pasted the pre mark duplicate bam reads for the requested querynames below.  The FASTA and markduplicate reads for these querynames was in my original post.

    NB551226:81:H273NBGXB:3:22406:1443:19094 83 chr11 108117636 60 101M = 108117582 -155 GCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTGGATTACAGGTGCTTATGAATCAACAAAATGGAGAAGTATTTTATACAACTTATA /EE///EAEEEEAAAE<EA/A6EEEEE/EEEEE/EEEE/EEEE//EE/AEEE6EE/AEEEEEEEEAEAAEEEEE/E/AAAAAEEEEEEEAEEAE/EAAAAA NM:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:88 RG:Z:BR1 XA:Z:ATM,-281,3S98M,2;

    NB551226:81:H273NBGXB:2:22111:10308:4079 83 chr11 108117636 60 101M = 108117582 -155 GCAGTGTAAACAGAGTACATACATAAAAATTACATTTTAATTTTTTGGATTACAGGTGCTTATGAATCAACAAAATGGAGAAGTATTTTATACAACTTATA AAEEAEE/EE/EEAE6EEE/A<AAEEEEAAAEEEEEEAAEAAAAEE<AEE/E/EAAEEEEEEEEEEEEE/EAE//EEEAEEEE6E6EEEEEEE/EAAAAAA NM:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:88 RG:Z:BR1 XA:Z:ATM,-281,3S98M,2;

     

     

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Can you show us the mates of these reads?

    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    I've never outputted separate reads from bwa, I'd have to look into this if this is what you want.  The FASTA rows in my original post do have read 1 and 2 for these querynames.

    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    Hi Tiffany,

    Above, I kept writing FASTA when I meant FASTQ.  Also, I did not see a way to configure BWA to output the align reads independently.  Do you know what setting I would use?  Alternatively, I can do a BMA to FASTQ conversion and provide you the query name results.

    0
    Comment actions Permalink
  • Avatar
    Tiffany Miller

    Hi Tom, We weren't able to replicate your results (aligned fastq with bwa and ran MD) - everything looked as expected. Could you share your pre- Mark Duplicates bam by following these instructions?

    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    Hi Tiffany,

    I uploaded the zipped file: MD_Issue_200701.zip, the contents are:

    Files created from markDuplicates; AgilentMale-BL-R-1-B-190508_S1_MD.bam,  AgilentMale-BL-R-1-B-190508_S1_marked_dup_metrics.txt , out.txt, AgilentMale-BL-R-1-B-190508_S1_MD.bam.bai, AgilentMale-BL-R-1-B-190508_S1_MD.bam.sbi 

    Modified reference genome: hg19_mix4.fasta.dict , hg19_mix4.fasta.fai and hg19_mix4.fasta 

    Input BAM file: AgilentMale-BL-R-1-B-190508_S1.bam

    Command line used: command_line.txt 

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Tom Morrison

    I have a couple of suggestions for you to try, 

    • Can you please try to run it with MarkDuplicates instead of MarkDuplicatesSpark and see if that makes any difference?
    • Try to run the data without --optical-duplicate-pixel-distance 2500 argument. I wonder if this value is too high. And lets use all default values and the remove duplicates arguments, and so your command would look like:
    $gatk4 MarkDuplicatesSpark \
    -I "${aligned}/${fp}.bam" \
    -O "${deduped}/${fp}_MD.bam" \
    -M "${deduped}/${fp}_marked_dup_metrics.txt" \
    --REMOVE-DUPLICATES TRUE\
    --REMOVE_SEQUENCING_DUPLICATES TRUE\
    • The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. So to determine if these reads qualify as duplicates we will need to look at the alignment of the mate pairs. Can you please provide the mate pairs alignments of the reads you think should be filtered out. You can use samtools to extract mate pairs, please look up samtools documentation for more info. Note: QNAME IDs will be identical for mates. Please look up SAM file format specifications for more information. 
    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    Hi Bahnu,

    using non spark MarkDuplicates did not alter the results.

    Removing the  --optical-duplicate-pixel-distance 2500 argument did not alter the result.

    Have you attempted to reproduce the results with my uploaded files?

     

    Also, FYI, --REMOVE-DUPLICATES TRUE and  --REMOVE-DUPLICATES TRUE need to be in lower case and the two suggested options can not be run at same time.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Tom Morrison

     

    We reproduce results on our end once we have exhausted all other options. Can you please provide the mate pairs alignments of the reads you think should be filtered out? As mentioned above, MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. The reason I asked for more information on the mate pairs is because maybe they don't match and that is why those reads are not getting filtered out. 

    Please refer to do the tool docs for the exact arguments:
    https://gatk.broadinstitute.org/hc/en-us/articles/360045681971-MarkDuplicates-Picard-#--REMOVE_DUPLICATES
    https://gatk.broadinstitute.org/hc/en-us/articles/360045681971-MarkDuplicates-Picard-#--REMOVE_SEQUENCING_DUPLICATES

    0
    Comment actions Permalink
  • Avatar
    Tom Morrison

    Hi Bhanu,

    Two example read pairs are in my first post.  Both of their fastq R1 & R2, their bam output from markduplicates, and their appearance in IGV.  Their 5-prime sequences are identical (their entire sequences are identical)  Did you want some additional information?

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Tom Morrison

    My sincere apologies for the delay in getting back to you. This question fell off my radar. I will however prioritize it now.

     

    I took a look at the data you shared with us. I am looking for the sequence in the "AgilentMale-BL-R-1-B-190508_S1.bam" file in the zip folder. I tried to search for the QNAME and the Sequence from your first post in the bam file and couldn't find either. If you are still facing this issue, can you please share the bam records of the sequences with me? You can do this by using the grep command for QNAME or SEQ on the output of "samtools view <bam>" command.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk