Called Somatic Mutations High Frequency of C>A SNVs
AnsweredHi Gatk Team,
c. Why do I see? (gatk/4.1.9.0)
Why do I see an overabundance of C>A SNVs called in my output from mutect2 (MAF file is a combination of 10 normal>tumor). It is not immediately apparent to me why this might be the case. Is there a filtering step I am not implementing correctly? I attempted to port the .wdl workflows for RNA snv/indel calling and Mutect2 best practice on a local cluster. I have 43mill/reads average per sample and they are RNAseq on NovaSeqS2/SP.
It seems I am calling many SNVs and the culprit is overabundance of C>A SNVs (top right hand graph. The output graph is graphed in MAFtools in R and was generated when I merged MAF outputs from vcf2maf package which were generated by Mutect2 outputs.
Note: in both of these pictures, the samples were not annotated in the VCF to MAF conversion (quick and dirty) and the output is just the 'sum' of them.
Below is output from Mutect2 (Downstream variant calling 15% known variants, 85% novel variants)
Below is the output from VCF2MAF for outputs of HaplotypeCaller. (Downstream variant calling 85% known variants, 15% novel variants)
To make my analysis-ready reads:
echo "STEP1_Beginning.................."
# STEP1 begin call to make unmapped
input_bam=${AddOrReplace_output_bam}
base_name=${sampleName}.reverted
sort_order="queryname"
picard RevertSam \
--INPUT ${input_bam} \
--OUTPUT ${base_name}.bam \
--VALIDATION_STRINGENCY SILENT \
--ATTRIBUTE_TO_CLEAR FT \
--ATTRIBUTE_TO_CLEAR CO \
--SORT_ORDER ${sort_order}
revertSAM_output_bam=${base_name}.bam
echo "STEP2_Beginning................."
# STEP2 MergeBAM
unaligned_bam=${revertSAM_output_bam}
star_bam=${bam}
base_name=${sampleName}.merged
picard MergeBamAlignment \
--REFERENCE_SEQUENCE ${ref_fasta} \
--UNMAPPED_BAM ${unaligned_bam} \
--ALIGNED_BAM ${star_bam} \
--OUTPUT ${base_name}.bam \
--INCLUDE_SECONDARY_ALIGNMENTS false \
--VALIDATION_STRINGENCY SILENT
MergeBamAlignment_output_bam=${base_name}.bam
echo "STEP3_Beginning.................."
# S3 - call to MarkDuplicates (picard)
input_bam=${MergeBamAlignment_output_bam}
base_name=${sampleName}.dedupped
picard MarkDuplicates \
--INPUT ${input_bam} \
--OUTPUT ${base_name}.bam \
--CREATE_INDEX true \
--VALIDATION_STRINGENCY SILENT \
--METRICS_FILE ${base_name}.metrics
MarkDuplicates_output_bam=${base_name}.bam
MarkDuplicates_output_bam_index=${base_name}.bai
MarkDuplicates_metrics_file=${base_name}.metrics
echo "***************** STEP4_Beginning **********************"
# S4 - call to SplitNCigarReads (gatk) to recreate BAM and .bai.
input_bam=${MarkDuplicates_output_bam}
base_name=${sampleName}.split
gatk SplitNCigarReads \
-R ${ref_fasta} \
-I ${input_bam} \
-O ${base_name}.bam
SplitNCigarReads_output_bam=${base_name}.bam
SplitNCigarReads_output_bam_index=${base_name}.bam
echo "***************** STEP5_Beginning **********************"
# S5 - call to BaseRecalibrator (gatk)
input_bam=${SplitNCigarReads_output_bam}
input_bam_index=${SplitNCigarReads_output_bam_index}
recal_output_file=${sampleName}.recal_data.csv
dbSNP_vcf=${dbSNPvcf}
gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal \
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails \
-Xloggc:gc_log.log -Xms4000m" \
BaseRecalibrator \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${recal_output_file} \
-known-sites ${dbSNP_vcf} #\
# -known-sites ${sep=" --known-sites " known_indels_sites_VCFs}
BaseRecalibrator_recalibration_report=${recal_output_file}
echo "***************** STEP6_Beginning **********************"
# S6 - call to ApplyBQSR
input_bam=${SplitNCigarReads_output_bam}
input_bam_index=${SplitNCigarReads_output_bam_index}
base_name=${sampleName}.aligned.duplicates_marked.recalibrated
recalibration_report=${BaseRecalibrator_recalibration_report}
gatk --java-options "-XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps \
-XX:+PrintGCDetails -Xloggc:gc_log.log \
-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xms3000m" \
ApplyBQSR \
--add-output-sam-program-record \
-R ${ref_fasta} \
-I ${input_bam} \
--use-original-qualities \
-O ${base_name}.bam \
--bqsr-recal-file ${recalibration_report}
ApplyBQSR_output_bam=${base_name}.bam
ApplyBQSR_output_bam_index=${base_name}.bai
-
Hi Jeremy Huang,
Thanks for reaching out. You mentioned that you attempted to port the WDL workflows for RNA SNV/indel calling, would you be able to confirm you have done so successfully?
The results you obtained could be explained by a missed filtering step, I would recommend double-checking this and seeing if you receive the same results.
Kind regards,
Emil
-
Thank you for your response: This is my call to our slurm. The overabundance of the C>A or G>T i detect could be an ob-prior problem. It is supposed to auto-detect the C>A, G>T oxog artifacts but it looks like it is not doing that for my samples. Please assist.
sbatch -p short -t 0-2:00 -c 1 --job-name Mutect2 --mem 20G --wrap="
gatk --java-options '-Xmx18g -Xms18g' \
LearnReadOrientationModel \
-I ${mut[2]}_Eu_f1r2.tar.gz \
-O ${mut[2]}_Eu_read-orientation-model.tar.gzgatk --java-options '-Xmx18g -Xms18g' \
GetPileupSummaries \
-I ../${mut[1]} \
-V ${dbSNPvcf} \
-L ${dbSNPvcf} \
-O ${mut[2]}_Eu_tumor_pileup.tablegatk --java-options '-Xmx18g -Xms18g' \
GetPileupSummaries \
-I ../${mut[0]} \
-V ${dbSNPvcf} \
-L ${dbSNPvcf} \
-O ${mut[2]}_Eu_normal_pileup.tablegatk --java-options '-Xmx18g -Xms18g' \
CalculateContamination \
-I ${mut[2]}_Eu_tumor_pileup.table \
-matched ${mut[2]}_Eu_normal_pileup.table \
-O ${mut[2]}_Eu_calculatecontamination.tablegatk --java-options '-Xmx18g -Xms18g' \
FilterMutectCalls \
-R ${ref_fasta} \
-V ${mut[2]}_Eu_somatic_unfiltered.vcf.gz \
--contamination-table ${mut[2]}_Eu_calculatecontamination.table \
--ob-priors ${mut[2]}_Eu_read-orientation-model.tar.gz \
--filtering-stats ${mut[2]}_Eu_filtering.stats \
-O ./filtered_vcfs/${mut[2]}_Eu_somatic_filtered.vcf.gz
"Is there anything noticeably missing?
-
Since these were run on Novaseq, it could be Oxog oxidative artifact but shouldn't it be filtered with the ob-priors argument? G/T artifact
Output of filtering.stats.
#<METADATA>Ln prior of deletion of length 10=-20.72326583694641
#<METADATA>Ln prior of deletion of length 9=-20.72326583694641
#<METADATA>Ln prior of deletion of length 8=-20.72326583694641
#<METADATA>Ln prior of deletion of length 7=-20.72326583694641
#<METADATA>Ln prior of deletion of length 6=-20.72326583694641
#<METADATA>Ln prior of deletion of length 5=-20.72326583694641
#<METADATA>Ln prior of deletion of length 4=-20.72326583694641
#<METADATA>Ln prior of deletion of length 3=-14.696775185028239
#<METADATA>Ln prior of deletion of length 2=-13.072599317975056
#<METADATA>Ln prior of deletion of length 1=-13.02606932918164
#<METADATA>Ln prior of SNV=-9.361020197743665
#<METADATA>Ln prior of insertion of length 1=-15.652227233404307
#<METADATA>Ln prior of insertion of length 2=-20.72326583694641
#<METADATA>Ln prior of insertion of length 3=-20.72326583694641
#<METADATA>Ln prior of insertion of length 4=-20.72326583694641
#<METADATA>Ln prior of insertion of length 5=-20.72326583694641
#<METADATA>Ln prior of insertion of length 6=-20.72326583694641
#<METADATA>Ln prior of insertion of length 7=-20.72326583694641
#<METADATA>Ln prior of insertion of length 8=-20.72326583694641
#<METADATA>Ln prior of insertion of length 9=-20.72326583694641
#<METADATA>Ln prior of insertion of length 10=-20.72326583694641
#<METADATA>Background beta-binomial cluster=weight = 0.3206, alpha = 5.63, beta = 10.33
#<METADATA>High-AF beta-binomial cluster=weight = 0.0018, alpha = 9.97, beta = 1.26
#<METADATA>Binomial cluster=weight = 0.6775, mean = 0.278
#<METADATA>threshold=0.655
#<METADATA>fdr=0.275
#<METADATA>sensitivity=0.659
filter FP FDR FN FNR
weak_evidence 474.42 0.1 1505.4 0.29
strand_bias 16.51 0.0 0.2 0.0
contamination 195.42 0.04 25.84 0.0
normal_artifact 33.39 0.01 51.74 0.01
orientation 561.58 0.12 45.92 0.01
slippage 0.0 0.0 0.0 0.0
haplotype 276.01 0.06 29.58 0.01
germline 2.1 0.0 0.56 0.0 -
Hi Jeremy Huang,
Thanks for providing us with that information!
I talked to a member of our engineering team, and it looks like you are running the orientation bias filter correctly.
Would you be able to send the Mutect2 command and not just the pipeline starting from LearnReadOrientationModel?
The filtering stats seem to be saying that FilterMutectCalls is aware of a lot of false positives from orientation bias artifacts but is willing to let a lot of questionable calls pass because the sensitivity would be too low otherwise. This can be overridden in FilterMutectCalls using-threshold-strategy FALSE_DISCOVERY_RATE -false-discovery-rate _____
where____
is some low maximum acceptable false discovery rate.Our engineers were also curious why FilterMutectCalls is so pessimistic about the sensitivity. They mentioned that this could happen if sequencing depth or tumor variant allele fractions are low.Kind regards,Emil
-
Dear Emil,
Thank you for the reply. I will definitely try the FilterMutectCalls using the FDR approach. I was wondering also if I could tune the --f-score-beta to being 0.5 or something like that to further weed out false positives? Below is my mutect2 call script. Pons were taken from broad for debugging purposes. I have previously run the pipeline with pons created from the 'normal' samples (17 in total) and the proportion of C>A was approximately the same.
#!/bin/bash
# This script iterates through and runs a script which calls mutect2 on prepared inputs.module load java/jdk-11.0.11
module load gcc/6.2.0
module load gatk/4.1.9.0# RUN DEFINED with Broad references
ref_fasta=${project_folder}/references/broad_vcf/Homo_sapiens_assembly38.fasta
dbSNPvcf=${project_folder}/references/broad_vcf/somatic-hg38_af-only-gnomad.hg38.vcf.gz
Eu_pon=${project_folder}/references/broad_vcf/1000g_pon.hg38.vcf.gz
interval_list=${project_folder}/references/broad_vcf/Homo_sapiens_assembly38.interval_listbuild_folder=${project_folder}/pipeline/STEP3_Mutect2_genPons
mkdir ${build_folder}/bams
cd ${build_folder}echo "######################## STEP1_PON_Eu ######################"
input="${project_folder}/references/pons_ref/Somatic_Eu_Dyads.txt"
while IFS= read -r line
do
IFS=',' read -ra mut <<< "$line"normal=${mut[0]}
tumor=${mut[1]}
normal_id=${mut[2]}
tumor_id=${mut[3]}echo "${normal}"
echo "${tumor}"sbatch -p medium -t 0-20:00 -c 1 --job-name Mutect2 --mem 15G --wrap="
gatk --java-options '-Xmx13g -Xms13g' \
Mutect2 -R ${ref_fasta} \
-L ${interval_list} \
-I ${bams_folder}/${tumor} \
-I ${bams_folder}/${normal} \
-normal ${normal_id} \
-tumor ${tumor_id} \
--germline-resource ${dbSNPvcf} \
-pon ${Eu_pon} \
--f1r2-tar-gz ${build_folder}/${normal_id}_Eu_f1r2.tar.gz \
-O ${build_folder}/${normal_id}_Eu_somatic_unfiltered.vcf.gz \
-bamout ${build_folder}/bams/${tumor_id}_tumor_normal_${normal_id}_Eu.bam
"
sleep 10
done < "$input" -
Have you had any luck using the FDR approach on FilterMutectCalls?
Using f-score-beta to weed out false positives sounds reasonable for this use case.
You mentioned you were previously using a pons that you created: we recommend that you always use the pons provided by the Broad, even when not debugging.
We are still curious to know why FilterMutectCalls is so pessimistic about the sensitivity - would you happen to know why this is?
Kind regards,
Emil
Please sign in to leave a comment.
6 comments