Different read statistics for a common control sample in two Mutect2 runs
Hi, I used Mutect2 (GATK-4.1.4.0) to identify somatic variants in two distinct samples vs a common control, however the read statistics from both VCF output files show different values for the control sample, which is problematic for our subsequent analysis (clonal evolution). Below is an example (before filtering):
##normal_sample=G4_C
##source=Mutect2
##tumor_sample=G4_P1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G4_C G4_P1
chr1 229654549 . T G . . DP=66;ECNT=11;MBQ=20,20;MFRL=174,103;MMQ=60,60;MPOS=13;NALOD=1.40;NLOD=7.22;POPAF=6.00;TLOD=6.06 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|0:28,0:0.038:28:13,0:14,0:0|1:229654515_C_A:229654515:16,12,0,0 0|1:36,2:0.065:38:13,0:21,2:0|1:229654515_C_A:229654515:19,17,1,1
##normal_sample=G4_C
##source=Mutect2
##tumor_sample=G4_L1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G4_C G4_L1
chr1 229654549 . T G . . DP=122;ECNT=2;MBQ=20,20;MFRL=182,151;MMQ=60,60;MPOS=4;NALOD=1.36;NLOD=6.61;POPAF=6.00;TLOD=6.85 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|0:33,0:0.042:33:19,0:12,0:0|1:229654549_T_G:229654549:16,17,0,0 0|1:80,3:0.051:83:54,1:25,2:0|1:229654549_T_G:229654549:35,45,1,2
The commands that I used for both sample pairs are:
gatk Mutect2 -R Homo_sapiens_assembly38.fasta --germline-resource af-only-gnomad.hg38.vcf.gz -I G4_L1.dedup.recal.bam -tumor G4_L1 -I G4_C.dedup.recal.bam -normal G4_C -L G4_ INTERVALS/0000-scattered.interval_list -O G4_L1_Mutect2nf_0000.vcf --f1r2-tar-gz G4_L1_Mutect2_f1r2_0000.tar.gz
gatk Mutect2 -R Homo_sapiens_assembly38.fasta --germline-resource af-only-gnomad.hg38.vcf.gz -I G4_P1.dedup.recal.bam -tumor G4_P1 -I G4_C.dedup.recal.bam -normal G4_C -L G4_ INTERVALS/0000-scattered.interval_list -O G4_P1_Mutect2nf_0000.vcf --f1r2-tar-gz G4_P1_Mutect2_f1r2_0000.tar.gz
This is scattered across 20 cores using different regions passed with –L. The regions are identical in both sample pairs. This is just a test run for 3 samples to evaluate material quality which is why I don’t use PoN. I didn't get any error and the results look reasonable except the control sample statistics which differ for 233 out of 260 common sites.
Does the read filtering in control depend on the tumor sample? I understand that the active regions are different in both pairs which can affect realignment, but why would it lead to such significant differences?
-
Roman Jaksik You're right that active regions can theoretically affect this but that you shouldn't expect such a big difference. I think that the bigger factor here is that the different tumors are yielding different assembled haplotypes. The annotations that look like 1:229654515_C_A in your VCF are phasing tags, and you can see that they are different from one run to another.
If your pipeline demands consistency you can generate a single assembly for the normal and both tumors in Mutect2's multi-sample mode, where in a single command you specify -I normal.bam I tumor1.bam -I tumor2.bam -normal G4_C -f1r2-tar-gz joint-f1r2.tar.gz etc. You can do this for an arbitrary number of tumor and normal samples from the same individual.
-
Thank you David. I will try this out.
Please sign in to leave a comment.
2 comments