Huge difference in number of mutations when switching GATK versions
GATK versions used: 4.6.1.0 and 4.0.10.0
I recently upgraded GATK versions after a break in doing variant calling. I followed this guide. As a test I ran some samples through both versions to see what the difference would be. To my surprise, the new version called 4-14x the number of mutations over 10 different samples as the earlier version. I am only counting mutations that have passed the filter. Is this to be expected? I see there is now a -f-score-beta option for FilterMutectCalls which I did not set. I'm not sure what a sensible value for this would be, or how I could go about finding one. Any help that could be offered would be very appreciated. Thanks!
-
There are quite a bit of changes underneath the Mutect2 engine since version 4.0 till now therefore Mutect2 has become more sensitive and more intelligent along the way. Using Mutect2's filters will allevieate and remove many of those false positives therefore our suggestion would be to
1- Use a matched normal for tumor calling
2- Use a valid PoN (specifically ours) to remove sequencing artifacts
3- Use contamination filters to avoid potential cross-sample contamination
If you wish to know more about Mutect2's filtering strategy I recommend you to read the following documentation from us.
https://github.com/broadinstitute/gatk/blob/master/docs/mutect/mutect.pdf
I hope this helps.
Regards.
-
Thanks for your reply. This is with using a normal control and using FilterMutectCalls as described in the guide I linked. If it would be helpful I can post the exact commands used for one run. I didn't use a PoN, I thought they had to be sequenced the same as our data, ie basically we had to make our own, but I'll retry it with that.
Edit: Actually I'll just throw in the way I'm doing it with the new version regardless. I'll have the old version up soon.gatk --java-options -Xmx4000m GetPileupSummaries \
-I GATK_runs/S77_2_1EX_BUC/ApplyBQSR/S77_2_1EX_BUC_recal.bam \
--interval-set-rule INTERSECTION \
-L /home/dylan/ref_data/hg38/remapped_S07604624_Padded_primaryOnly.bed \
-V ~/ref_data/hg38//small_exac_common_3.hg38.vcf.gz \
-L ~/ref_data/hg38//small_exac_common_3.hg38.vcf.gz \
-O GATK_runs/S077_2_2EX_OCT/CalculateContamination/normal_pileups.table &> GATK_runs/S077_2_2EX_OCT/CalculateContamination/out.log
NORMAL_CMD="-matched GATK_runs/S077_2_2EX_OCT/CalculateContamination/normal_pileups.table"
gatk --java-options -Xmx4000m GetPileupSummaries \
-R ~/ref_data/hg38//Homo_sapiens_assembly38.fasta \
-I GATK_runs/S077_2_2EX_OCT/ApplyBQSR/S077_2_2EX_OCT_recal.bam \
--interval-set-rule INTERSECTION \
-L /home/dylan/ref_data/hg38/remapped_S07604624_Padded_primaryOnly.bed \
-V ~/ref_data/hg38//small_exac_common_3.hg38.vcf.gz \
-L ~/ref_data/hg38//small_exac_common_3.hg38.vcf.gz \
-O GATK_runs/S077_2_2EX_OCT/CalculateContamination/pileups.table &>> GATK_runs/S077_2_2EX_OCT/CalculateContamination/out.log
gatk CalculateContamination \
-I GATK_runs/S077_2_2EX_OCT/CalculateContamination/pileups.table \
-O GATK_runs/S077_2_2EX_OCT/CalculateContamination/con_tab.table \
--tumor-segmentation GATK_runs/S077_2_2EX_OCT/CalculateContamination/seg_tab.table \
-matched GATK_runs/S077_2_2EX_OCT/CalculateContamination/normal_pileups.table &>> GATK_runs/S077_2_2EX_OCT/CalculateContamination/out.log
gatk --java-options -Xmx4000m GetSampleName \
-R ~/ref_data/hg38//Homo_sapiens_assembly38.fasta \
-I GATK_runs/S077_2_2EX_OCT/ApplyBQSR/S077_2_2EX_OCT_recal.bam \
-O GATK_runs/S077_2_2EX_OCT/M2/tumor_name.txt -encode
tumor_command_line="-I GATK_runs/S077_2_2EX_OCT/ApplyBQSR/S077_2_2EX_OCT_recal.bam -tumor `cat GATK_runs/S077_2_2EX_OCT/M2/tumor_name.txt`"
gatk --java-options -Xmx4000m GetSampleName \
-R ~/ref_data/hg38//Homo_sapiens_assembly38.fasta \
-I GATK_runs/S77_2_1EX_BUC/ApplyBQSR/S77_2_1EX_BUC_recal.bam \
-O GATK_runs/S077_2_2EX_OCT/M2/normal_name.txt -encode
normal_command_line="-I GATK_runs/S77_2_1EX_BUC/ApplyBQSR/S77_2_1EX_BUC_recal.bam -normal `cat GATK_runs/S077_2_2EX_OCT/M2/normal_name.txt`"
gatk --java-options -Xmx4000m Mutect2 \
-R ~/ref_data/hg38//Homo_sapiens_assembly38.fasta \
${tumor_command_line} \
${normal_command_line} \
--germline-resource ~/ref_data/hg38//af-only-gnomad.hg38.vcf.gz \
-L /home/dylan/ref_data/hg38/remapped_S07604624_Padded_primaryOnly.bed \
-O "GATK_runs/S077_2_2EX_OCT/M2/out.vcf" \
--bam-output GATK_runs/S077_2_2EX_OCT/M2/out.bam \
--f1r2-tar-gz GATK_runs/S077_2_2EX_OCT/M2/f1r2.tar.gz &> GATK_runs/S077_2_2EX_OCT/M2/out.log
gatk --java-options -Xmx4000m LearnReadOrientationModel \
-I "GATK_runs/S077_2_2EX_OCT/M2/f1r2.tar.gz" \
-O "GATK_runs/S077_2_2EX_OCT/LearnReadOrientationModel/art_tab.tsv.tar.gz" &> GATK_runs/S077_2_2EX_OCT/LearnReadOrientationModel/log.log
gatk --java-options -Xmx4000m FilterMutectCalls \
-V GATK_runs/S077_2_2EX_OCT/M2/out.vcf \
-O GATK_runs/S077_2_2EX_OCT/Filter/S077_2_2EX_OCT.vcf \
-R ~/ref_data/hg38//Homo_sapiens_assembly38.fasta \
--contamination-table GATK_runs/S077_2_2EX_OCT/CalculateContamination/con_tab.table \
--ob-priors GATK_runs/S077_2_2EX_OCT/LearnReadOrientationModel/art_tab.tsv.tar.gz \
--tumor-segmentation GATK_runs/S077_2_2EX_OCT/CalculateContamination/seg_tab.table &> GATK_runs/S077_2_2EX_OCT/Filter/out.log -
Hi again.
The way your workflow goes looks totally fine. As I said there are quite a bit of changes under the hood that made Mutect2 more sensitive so observing more variant calls is fine.
Please sign in to leave a comment.
3 comments