Discrepancy between GATK joint vcf and recaled.bam files
AnsweredWe did 30x coverage whole genome sequencing of 8 inbred DBA/2J mice that showed phenotypic differences between Group 82 (4 mice) and Group 87 (4 mice), in order to discover if there are any genotypic differences that could explain the phenotypic differences. After going through the Sentieon GATK pipeline, we used the vcf files to select some variants of interest to look at more closely.
However, when we looked at our first region of interest in IGV using the recaled.bam files (loaded recaled.bam file into IGV → zoomed in to 41bp window at region of interest → exported alignments → searched for sequence of interest), we saw that there was no difference in the ratios of REF:ALT between the Group 82 and Group 87 mice (Table 1), contradicting the vcf file which called the Group 82 mice as 0/1 and the Group 87 mice as 0/0 (all with GQ>20).
A separate discrepancy also occurred when we looked at our second region of interest (Fig. 1): the top left small windows show the joint vcf file calling Mouse B82_1’s AD as “26,3” and Mouse B87_1’s AD as “24,4”. However, the bottom right small windows show the recaled.bam files showing different allele depths -Mouse B82_1 is “26,9” and Mouse B87_1 is “24,5”. The same can be seen when we look at our third region of interest (Fig. 2). From what I understand of the GATK pipeline, recaled.bam files contain analysis ready reads, so I am not sure why this discrepancy is occurring -is there some other filtering step between recaled.bam and generating the joint vcf file that removes some of the reads, resulting in different allele depths being called? If yes, which raw reads should we look at to validate which of our variants of interests could be true variants to proceed with further Sanger sequencing validation?
Finally, based on these discrepancies, we have some questions:
- Are these problems caused by using the less well-studied DBA/2J genome? Would it be better if we used a cleaned up reference genome like C57/BL6? However, if we use a different reference genome, we are worried that that will cause some reads to not align well, because our mice are DBA/2J, not C57/BL6.
- Is it common to have mRNA contamination interfere with read alignment and variant calling?
We would appreciate any other suggestions or comments, thank you!
-
Here are the tables and figures:
Fig. 2
Fig. 1
-
Hi Xiao Ran Luo,
I am going to move your post into our Community Discussions -> General Discussion topic, as the Non-Human topic is for reporting bugs and issues with GATK.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Best,
Genevieve
-
Hi Xiao Ran Luo,
Thanks for writing into the forum! I have a few key suggestions that might help you to make sense of what is going on with your samples.
- We have a troubleshooting document that goes over why certain variants are called. When HaplotypeCaller and Mutect2 do not call an expected variant. It was written by our developers with steps they follow when troubleshooting these types of cases.
- I would recommend looking at the -bamout file from HaplotypeCaller to see what these variants look like after all of the read realignment and other steps in the HaplotypeCaller algorithm. HaplotypeCaller is not a simple pileup caller, see this article for more details.
- There are certain read filters that are used by HaplotypeCaller, listed in the tool documentation under Read Filters.
- I wasn't able to understand your first example, the table didn't really make it clear for me. Can you show the actual VCF lines and bamout view?
- For the second example, the HOM_REF sample has a GQ of 0, which indicates that the tool identified something happening in the region, but could not call it correctly.
- Could you share the bamout example in IGV for the third example and the VCF lines?
- We have an article about when allele depth is lower than expected, you can take a look here.
I'm not sure about the genomes you mentioned in your post, or about specific mRNA contamination. Potentially other users who have insight into these topics can chime in. Let me know what you find and if you have further questions.
Best,
Genevieve
-
Hi Genevieve,
Thank you so much for your reply!
Re: 4, here are the VCF lines for the 1st example, bmp2k, with some columns that only contained ‘NA’ deleted:
Re: 6, here are the VCF lines for the 3rd example, spin2c, with some columns that only contained ‘NA’ deleted:
Please let me know if you have any additional questions, and thanks again for your help!
Best,
Luna
-
Hi Luna,
It's too difficult to see this info in screenshots, could you paste the text VCF file into a comment?
Thanks,
Genevieve
Please sign in to leave a comment.
5 comments