Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Called Somatic Mutations High Frequency of C>A SNVs

Answered
0

6 comments

  • Avatar
    Emil Furat

    Hi Jeremy Huang,

     

    Thanks for reaching out. You mentioned that you attempted to port the WDL workflows for RNA SNV/indel calling, would you be able to confirm you have done so successfully? 

    The results you obtained could be explained by a missed filtering step, I would recommend double-checking this and seeing if you receive the same results. 

     

    Kind regards,

    Emil

    0
    Comment actions Permalink
  • Avatar
    Jeremy Huang

    Thank you for your response: This is my call to our slurm. The overabundance of the C>A or G>T i detect could be an ob-prior problem. It is supposed to auto-detect the C>A, G>T oxog artifacts but it looks like it is not doing that for my samples. Please assist.

     

    sbatch -p short -t 0-2:00 -c 1 --job-name Mutect2 --mem 20G --wrap="
    gatk --java-options '-Xmx18g -Xms18g' \
    LearnReadOrientationModel \
    -I ${mut[2]}_Eu_f1r2.tar.gz \
    -O ${mut[2]}_Eu_read-orientation-model.tar.gz

    gatk --java-options '-Xmx18g -Xms18g' \
    GetPileupSummaries \
    -I ../${mut[1]} \
    -V ${dbSNPvcf} \
    -L ${dbSNPvcf} \
    -O ${mut[2]}_Eu_tumor_pileup.table

    gatk --java-options '-Xmx18g -Xms18g' \
    GetPileupSummaries \
    -I ../${mut[0]} \
    -V ${dbSNPvcf} \
    -L ${dbSNPvcf} \
    -O ${mut[2]}_Eu_normal_pileup.table

    gatk --java-options '-Xmx18g -Xms18g' \
    CalculateContamination \
    -I ${mut[2]}_Eu_tumor_pileup.table \
    -matched ${mut[2]}_Eu_normal_pileup.table \
    -O ${mut[2]}_Eu_calculatecontamination.table

    gatk --java-options '-Xmx18g -Xms18g' \
    FilterMutectCalls \
    -R ${ref_fasta} \
    -V ${mut[2]}_Eu_somatic_unfiltered.vcf.gz \
    --contamination-table ${mut[2]}_Eu_calculatecontamination.table \
    --ob-priors ${mut[2]}_Eu_read-orientation-model.tar.gz \
    --filtering-stats ${mut[2]}_Eu_filtering.stats \
    -O ./filtered_vcfs/${mut[2]}_Eu_somatic_filtered.vcf.gz
    "

     

    Is there anything noticeably missing? 

    0
    Comment actions Permalink
  • Avatar
    Jeremy Huang

    Since these were run on Novaseq, it could be Oxog oxidative artifact but shouldn't it be filtered with the ob-priors argument? G/T artifact

    Output of filtering.stats.

     

    #<METADATA>Ln prior of deletion of length 10=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 9=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 8=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 7=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 6=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 5=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 4=-20.72326583694641
    #<METADATA>Ln prior of deletion of length 3=-14.696775185028239
    #<METADATA>Ln prior of deletion of length 2=-13.072599317975056
    #<METADATA>Ln prior of deletion of length 1=-13.02606932918164
    #<METADATA>Ln prior of SNV=-9.361020197743665
    #<METADATA>Ln prior of insertion of length 1=-15.652227233404307
    #<METADATA>Ln prior of insertion of length 2=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 3=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 4=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 5=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 6=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 7=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 8=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 9=-20.72326583694641
    #<METADATA>Ln prior of insertion of length 10=-20.72326583694641
    #<METADATA>Background beta-binomial cluster=weight = 0.3206, alpha = 5.63, beta = 10.33
    #<METADATA>High-AF beta-binomial cluster=weight = 0.0018, alpha = 9.97, beta = 1.26
    #<METADATA>Binomial cluster=weight = 0.6775, mean = 0.278
    #<METADATA>threshold=0.655
    #<METADATA>fdr=0.275
    #<METADATA>sensitivity=0.659
    filter FP FDR FN FNR
    weak_evidence 474.42 0.1 1505.4 0.29
    strand_bias 16.51 0.0 0.2 0.0
    contamination 195.42 0.04 25.84 0.0
    normal_artifact 33.39 0.01 51.74 0.01
    orientation 561.58 0.12 45.92 0.01
    slippage 0.0 0.0 0.0 0.0
    haplotype 276.01 0.06 29.58 0.01
    germline 2.1 0.0 0.56 0.0

    0
    Comment actions Permalink
  • Avatar
    Emil Furat

    Hi Jeremy Huang,

     

    Thanks for providing us with that information!

    I talked to a member of our engineering team, and it looks like you are running the orientation bias filter correctly.

    Would you be able to send the Mutect2 command and not just the pipeline starting from LearnReadOrientationModel?

     

    The filtering stats seem to be saying that FilterMutectCalls is aware of a lot of false positives from orientation bias artifacts but is willing to let a lot of questionable calls pass because the sensitivity would be too low otherwise. This can be overridden in FilterMutectCalls using -threshold-strategy FALSE_DISCOVERY_RATE -false-discovery-rate _____ where ____ is some low maximum acceptable false discovery rate.
     
    Our engineers were also curious why FilterMutectCalls is so pessimistic about the sensitivity.  They mentioned that this could happen if sequencing depth or tumor variant allele fractions are low.
     
     
    Kind regards,
    Emil
     


     

    0
    Comment actions Permalink
  • Avatar
    Jeremy Huang

    Dear Emil,

    Thank you for the reply. I will definitely try the FilterMutectCalls using the FDR approach. I was wondering also if I could tune the --f-score-beta to being 0.5 or something like that to further weed out false positives? Below is my mutect2 call script. Pons were taken from broad for debugging purposes. I have previously run the pipeline with pons created from the 'normal' samples (17 in total) and the proportion of C>A was approximately the same. 

    #!/bin/bash
    # This script iterates through and runs a script which calls mutect2 on prepared inputs.

    module load java/jdk-11.0.11
    module load gcc/6.2.0
    module load gatk/4.1.9.0

    # RUN DEFINED with Broad references
    ref_fasta=${project_folder}/references/broad_vcf/Homo_sapiens_assembly38.fasta
    dbSNPvcf=${project_folder}/references/broad_vcf/somatic-hg38_af-only-gnomad.hg38.vcf.gz
    Eu_pon=${project_folder}/references/broad_vcf/1000g_pon.hg38.vcf.gz
    interval_list=${project_folder}/references/broad_vcf/Homo_sapiens_assembly38.interval_list

    build_folder=${project_folder}/pipeline/STEP3_Mutect2_genPons
    mkdir ${build_folder}/bams
    cd ${build_folder}

    echo "######################## STEP1_PON_Eu ######################"

    input="${project_folder}/references/pons_ref/Somatic_Eu_Dyads.txt"
    while IFS= read -r line
    do
    IFS=',' read -ra mut <<< "$line"

    normal=${mut[0]}
    tumor=${mut[1]}
    normal_id=${mut[2]}
    tumor_id=${mut[3]}

    echo "${normal}"
    echo "${tumor}"

    sbatch -p medium -t 0-20:00 -c 1 --job-name Mutect2 --mem 15G --wrap="
    gatk --java-options '-Xmx13g -Xms13g' \
    Mutect2 -R ${ref_fasta} \
    -L ${interval_list} \
    -I ${bams_folder}/${tumor} \
    -I ${bams_folder}/${normal} \
    -normal ${normal_id} \
    -tumor ${tumor_id} \
    --germline-resource ${dbSNPvcf} \
    -pon ${Eu_pon} \
    --f1r2-tar-gz ${build_folder}/${normal_id}_Eu_f1r2.tar.gz \
    -O ${build_folder}/${normal_id}_Eu_somatic_unfiltered.vcf.gz \
    -bamout ${build_folder}/bams/${tumor_id}_tumor_normal_${normal_id}_Eu.bam
    "
    sleep 10
    done < "$input"

    0
    Comment actions Permalink
  • Avatar
    Emil Furat

    Hi Jeremy Huang,

     

    Have you had any luck using the FDR approach on FilterMutectCalls?

    Using f-score-beta to weed out false positives sounds reasonable for this use case.

    You mentioned you were previously using a pons that you created: we recommend that you always use the pons provided by the Broad, even when not debugging.

     

    We are still curious to know why FilterMutectCalls is so pessimistic about the sensitivity - would you happen to know why this is?

     

    Kind regards,

    Emil

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk