Mutect2 somatic mutation filtering
Hello, I am a student still in the process of learning. I have a question regarding selecting somatic mutations using Mutect2, so I would like to ask for some guidance.
After running Mutect2, I found approximately 240,000 variants in the unfiltered VCF file. Then, I performed FilterMutectCalls and filtered only the variants with 'PASS', resulting in about 32,000 variants. However, I feel that this number is still too high. I believe I need to further refine the selection of higher-quality mutations using additional criteria such as 'DP', 'ECNT', 'GERMQ', 'MBQ', 'MMQ', but I am not sure about the generally accepted thresholds for these filters.
Since my samples compare tumor samples before and after treatment, it is even harder to decide on the appropriate criteria.
I apologize if this is a naive question, but I would greatly appreciate any insights you could provide.
-
Hi microbiome
FilterMutectCalls uses those exact criteria that you proposed to use for filtering Mutect2 calls. Mutect2 is quite naive in terms of calling any variation as a variant but provides a detailed list of parameters along with each variant provided with a proper modeling, PoN and Normals those variants are tagged with various filters to indicate why they are not called as PASSing variants. Depending on the study you have it may be normal to have that many variants called as PASS.
Can you provide more details about your samples. Are they whole genome sequencing data or panel data?
Regards.
-
Hello Gökalp Çelik,
Thank you so much for your response and clarification! To answer your question, my data is whole genome sequencing (WGS). The samples I am working with are from tumor tissues, and I am comparing variants before and after treatment. Since the sample size is large and the number of variants that passed the filters is still quite high, I wanted to make sure that I am using the right criteria for filtering.
Would you recommend any specific thresholds for parameters like DP, ECNT, GERMQ, MBQ, and MMQ in the context of WGS? Or should I consider any other post-filtering strategies specific to WGS data to reduce the number of false positives?
I appreciate any further insights or recommendations you might have!
Best regards.
-
Hi microbiome
Since your samples are whole genome sequencing data it is normal to have so many calls. You may be able to limit the number of calls to regions close to coding segments so that you can minimize the number of variants to deal with especially those that can be functionally annotated.
Other than that we don't have definitive numbers for the parameters that you indicated. If your samples are pre and post treatment data you may be able to use pre and post treatment data as your pseudo-normals in separate runs and try to find out differences between calls in both states. It is not a simple task to completely delineate all variants however it may be a good start.
I hope this helps.
Regards.
-
Hi Gökalp Çelik,
Thank you for the explanation! I’ll try focusing on coding regions and using the pre-treatment data as pseudo-normals to narrow down the variants.I appreciate the advice!
Best regards.
Please sign in to leave a comment.
4 comments