GATK Handling of Paralogs
I work on an organism that is highly paralogous (salmon). It generally functions as a diploid with occasional tetraploidy and numerous paralogs.
I am trying to understand how GATK--particularly the germline short variant pipeline--handles paralogs and interprets them for variant calling. When I ran through the pipeline I treated my organism as diploid.
I am using whole genome sequencing (~10x coverage) and aligning to species-specific (pink salmon) assembly. As a non-model organism, pink salmon are assembled to chromosomes but with 10s of thousands of additional scaffolds.
For my alignment I used bwa mem with the -M flag. To my understanding, while reads may map to multiple sites, bwa mem stores the lower quality mapping as a secondary read which is subsequently ignored in default applications later in the GATK pipeline. This multiple mapping should translate to a lower MapQ score, but reads won't map to multiple sites in the subsequent pipeline. Though I am not certain of this and was hoping for some confirmation.
Additionally, I was wondering how this translated into hard filtering, specifically MQ and MQRankSum filters applied after GenotypeGVCFs. I know MQ interprets the variation of the score and not just the mean, but I wasn't sure how paralogous sites might affect MQ and MQRankSum.
While I plan to interpret paralogs later, what I am hoping to know is that they aren't affecting my standard SNP calling after applying GATK hard filters and the subsequent standard filtering procedures later on (genotype and individual missingness, MAF, etc.)?
I know this post starts to address some of these same questions, but not all of them.
Thanks!
-
Thank you for your post, Morgan Sparks! I want to let you know we have received your question. We'll get back to you if we have any updates or follow up questions.
Please see our Support Policy for more details about how we prioritize responding to questions.
Please sign in to leave a comment.
1 comment