ApplyBQSR for somatic mutation detection in HLA regionsAnswered
Hello, I'm a new GATK user and am working on detecting somatic mutations both in non-HLA and HLA regions using Mutect2. For non-HLA mutation calls, I've followed the recommended pipeline both for preprocessing and for somatic mutation calling (in particular genome build b37 as for somatic mutation this is still the supported build).
I already have the HLA types for my samples. Thus, for the HLA-regions, I'm following an approach in the spirit of the Polysolver shell_call_hla_mutations_from_type but am trying to also account for the newer GATK software. I've aligned with Novoalign the original fastq files (which I also used for the non-HLA mutation calling pipeline) to the specific HLA-type sequences and have removed dups and used the Polysolver filters and "changing flags and mapping quality" script. At this point Polysolver would run mutect. I plan to run Mutect2 instead. But before running it, I wanted to ApplyBQSR to proceed in an analogous manner as done for the non-HLA. My question is: for the latter, is it appropriate to use the recalibratioin table previously computed for these samples in the standard pipeline for non-HLA mutation calling? It seems to me that it would be, but would appreciate feedback. Thanks.
I am going to move your post into our Community Discussions -> New User Advice topic, as the somatic topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Thank you Pamela and sorry for originally posting under the incorrect topic.
Thanks for writing into our forum! Yes, it would be a good idea to use your recalibration tables from the non-HLA regions for your HLA analysis. We wouldn't recommend building a BQSR recalibration model on the HLA chromosomes because the BQSR algorithm assumes mismatches are errors. Since the HLA regions are variant dense, the BQSR model will not be appropriately built.
In terms of your overall analysis, we would recommend if possible to upgrade to hg38 or something newer than b37 because b37 does not have good HLA alternate contigs.
Please let us know if you have further questions.
Thank you so much for your response Genevieve, which confirms my initial intuition. If I may tap into your experience with this type of analysis, I do have a couple of related questions.
First, I should clarify that the reason I'm using b37 is because that is the build recommended at https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle for exome data, which is the type of data I'm working on. As mentioned, for somatic mutation in non-HLA regions I've followed the Mutect2 pipeline recommendations. For the HLA regions, I therefore wanted to use the same build. HLA-typing for the individuals in my data was previously done elsewhere, thus, for each individual, I'm aligning to the specific HLA sequences of those individuals. I'm also trying to roughly follow the mutation calling approach of Polysolver (in particular including its pre and post mutation calling processing), but I want to use Mutect2 pipeline as close as possible to that used for the non-HLA regions hence augmenting the polysolver pre- and post-processing to align to the latter. My 2 additional questions:
1. For the post Mutect2 filtering with FilterMutectCalls, it seems to me that (similarly to BQSR) I would use the same contamination table previously computed for these samples in the standard pipeline for non-HLA mutation calling. Is this correct?
2. Instead, for the -ob-priors in FilterMutantCalls on the HLA, I imagine that I have to learn the specific model, by running first Mutect2 on the HLA regions with the --f1r2-tar-gz option, and then using LearnReadOrientationModel on the resulting f1r2.tar.gz. Is this correct?
In short, for BQSR and contamination I would use what previously computed for these samples in the standard pipeline for non-HLA mutation calling, but for --ob-priors I have to specifically learn the model on the HLA.
I'd appreciate your feedback on this.
Yes, for both of your questions (1 and 2), that sounds like it would get the best results. Just a caveat though - I am not an expert, so don't take my recommendations as more important than what you see in your own research! I just know a lot about how GATK works and runs from working with our many users.
For your hg38 vs b37 - I see why you are running b37. However, if you have hg38 interval lists available, I think it would be best to use hg38. Especially if the HLA regions are important for your analysis. b37 has much worse HLA contigs available and you will miss variants on the HLA chromosomes.
Let me know if you have any further questions.
Thanks Genevieve, I really appreciate your feedback.
Regarding the build, just to clarify, I'm only aligning to b37 for the non-HLA analyses. For HLA I'm aligning to the type-specifc allele sequences from polysolver (I believe derived from IMGT-HLA), not to the contigs from b37.
Oh I see! That makes sense :)
Please sign in to leave a comment.