Mark duplicates in merged bam files sorted by coordinates
Hello
As it is mentioned in the documentation (https://gatk.broadinstitute.org/hc/en-us/articles/360037224932?page=1#comment_4406762304155), it is ideal to submit the query name based sorted bam files, will it be computationally intensive process to submit the coordinated based sorted bam files?
First, I sorted the unmapped and mapped bam files by queryname and merged these files and then sorted by coordinates. Can these merged bam files which are sorted by coordinates be used to mark duplicates by spark? Also, subsequently run SetNmMdAndUqTags before running BQSR.Please advice
Thanks
-
Hi Priyadarshini Thirunavukkarasu,
Coordinate sorting of the merged bam file is part of the MarkDuplicatesSpark step. The tool marks duplicates as well as sorts the bam file to be used in the rest of the data pre-processing pipeline. I would assume that if you have already sorted the file by coordinates, then running MarkDuplicatesSpark will simply mark the duplicates. Here is the full pipeline for pre-processing discovery for reference: https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery.
Kind regards,
Pamela
-
Hi Priyadarshini Thirunavukkarasu,
I am going to move your post into our Community Discussions -> Documentation Questions topic, as the Germline topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Genevieve
Please sign in to leave a comment.
2 comments