Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

too many somatic multi-hit mutations

0

11 comments

  • Avatar
    Gökalp Çelik

    Hi Eon

    In order to help us better we need more specifics about the type of data you are using and the type of experiment you are performing such as 

    - Are you using Illumina short reads data or some other platform?

    - Are you using matched-normal somatic calling?

    - Are you using the wdls provided for the best practices directly or do you have your own command lines for this purpose? Can you post your command line parameters?

    - What kind of data are you using, amplicon, whole genome sequencing, hybrid-capture, long reads, cfDNA etc...

    These kind of questions may need answers before we can further provide help on issues. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Eon

    Hi Gökalp:

    thanks for your help here, i'm so sorry that the information i provided was so insufficient. 

    1.my reads were generated form illumina platform

    2.yes, i'm using matched-normal somatic calling.

    3.i run the workflow on my own command line. i didn't change the default parameters 

    1) i first run the following 3 lines to generate normal and tumor bams

    sm=${pt}_n; 
    gatk --java-options "-Xmx4G -Djava.io.tmpdir=./" MarkDuplicates -I ${sm}.bam -O ${sm}_mk.bam -M ${sm}.metrics --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX true
    gatk BaseRecalibrator --java-options "-Xmx4G -Djava.io.tmpdir=./" -I ${sm}_mk.bam -R /mnt/d/gatk/hg19_bwa/hg19.fa --known-sites /mnt/d/gatk/VQSR/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf --known-sites /mnt/d/gatk/VQSR/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf  --known-sites /mnt/d/gatk/VQSR/hg19/dbsnp_138.hg19.vcf -L /mnt/d/gatk/VQSR/hg19/S04380110_Padded.bed -O ${sm}_recal.table
    gatk --java-options "-Xmx4G -Djava.io.tmpdir=./" ApplyBQSR -I ${sm}_mk.bam -R /mnt/d/gatk/hg19_bwa/hg19.fa --bqsr-recal-file ${sm}_recal.table -L /mnt/d/gatk/VQSR/hg19/S04380110_Padded.bed -O ${sm}_bqsr.bam --create-output-bam-index true

    2) then i run the following codes to generate vcf

    gatk Mutect2 -R /mnt/d/gatk/hg19_bwa/hg19.fa -L /mnt/d/gatk/VQSR/hg19/S04380110_Padded.bed  -I ../align/${st}_bqsr.bam -I ../align/${sn}_bqsr.bam -normal ${sn} --germline-resource /mnt/d/gatk/af-only-hg19.vcf.gz -pon /mnt/d/fasta/pon/pon.vcf.gz --f1r2-tar-gz ${pt}_f1r2.tar.gz -O ${pt}_unfiltered.vcf 
    gatk LearnReadOrientationModel -I ${pt}_f1r2.tar.gz -O ${pt}_read-orientation-model.tar.gz
    gatk GetPileupSummaries -I ../align/${st}_bqsr.bam -V /mnt/d/gatk/small_exac_common_3_hg19.vcf.gz -L /mnt/d/gatk/VQSR/hg19/S04380110_Padded.bed -O ${pt}_getpileupsummaries.table
    gatk CalculateContamination -I ${pt}_getpileupsummaries.table -tumor-segmentation ${pt}_segments.table -O ${pt}_contamination.table
    gatk FilterMutectCalls -R /mnt/d/gatk/hg19_bwa/hg19.fa -V ${pt}_unfiltered.vcf --tumor-segmentation ${pt}_segments.table --contamination-table ${pt}_contamination.table --ob-priors ${pt}_read-orientation-model.tar.gz -O ${pt}_filtered.vcf

    3) the pon was generated by folloing codes

    gatk Mutect2 -R /mnt/d/gatk/hg19_bwa/hg19.fa -I ${sm}_n_bqsr.bam --max-mnp-distance 0 -O /mnt/d/fasta/pon/${sm}_n.vcf.gz
    gatk --java-options "-Xmx7g -Xms7g" GenomicsDBImport -R /mnt/d/gatk/hg19_bwa/hg19.fa -L /mnt/d/gatk/VQSR/hg19/S04380110_Padded.bed --genomicsdb-workspace-path /mnt/d/fasta/pon/genomicsdb3/ --sample-name-map /mnt/d/fasta/pon/pon_map_noindex.txt --tmp-dir /mnt/d/fasta/pon/tmp/ --bypass-feature-reader true --merge-input-intervals true
    gatk CreateSomaticPanelOfNormals -R /mnt/d/gatk/hg19_bwa/hg19.fa --germline-resource /mnt/d/gatk/af-only-hg19.vcf.gz -V gendb://genomicsdb3 -O pon.vcf.gz

    4.i'm using the whole exome sequencing data.

    Once again, thank you for your support, and I look forward to resolving this issue with your guidance.

    Many thanks

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Eon

    From the looks of it your workflow is pretty much set. There are a few things I may ask

    1- Does that chart you generated from your MAF file only contain PASS variants or all variants regardless of the filtering status?

    2- I have not seen any --min-allele-fraction filter applied in the last step of filtering. It is possible that you might have wanted to contain all allele fractions as a part of the output however not all of them would be meaningful depending on the depth and the way data is collected. Unless you have a very high coverage data allele fractions around 0.00..s may not be all relevant to your sample and may possibly be another sequencing artifact. 

    Can you also elaborate on these?

    0
    Comment actions Permalink
  • Avatar
    Eon

    Hi Gökalp:

    Thanks for helping me. I wanted to express my heartfelt gratitude for your  advice. Thanks to your insightful suggestions, I was able to successfully resolve my issue. It seemed that i involved all the variants regardless their filter results, and after i preserved those PASS variants, the multiple multi-hits were gone.

    I wasn't aware about the --min-allele-fraction filter in the FilterMutectCalls function, i just ran the codes as default. I'd like to re-run the codes setting this filter. is there any value you recommend to set?

    Thank you once again for your assistance, and I am incredibly grateful for your support.

    Many Thanks

     
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi Eon

    Thank you as well returning your feedback. The allele fraction you are seeking may depend on your experimentation as well as the way sample is collected. In certain practices the mixture of tumor vs normal cells were determined before the library preparation by microscopy or other visual methods therefore there may be an estimate of how much of the mass is composed of tumor cells vs normal tissue. Others may prefer using arbitrary limits of their liking based on actionability of the variants. I would say try a few  different levels such as 0.01 0.05 0.1 and so on. Depending on the tumor/cancer type you may find some significant changes in the collection of your variants determining the underlying cause of the malignancy. 

    Of course these are not set in stone so the best solution will always come from your own analysis.

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    Eon

    Hi Gökalp:

    Thanks for your kind suggestions. i'd like to try different threshold and observe the differences.

    Thank you once again for your assistance.

    Best regards

     
    0
    Comment actions Permalink
  • Avatar
    daytimemouse

    Hi Eon,

    I'm experiencing a similar issue as you, but I have already used a filter VCF file to generate the MAF file.

    I greatly appreciate your assistance.

    1. I use WES data to generate a filtered VCF file (germline pipeline), and then utilize VEP for annotation to obtain an MAF file.
    2. The MAF file contains an excessive number of missense mutations, with up to 10 missense variants occurring in a single gene.
    3. Furthermore, the plot generated by maftools displays an overwhelming amount of multi_hit annotations.
    4. All genes exhibit a 100% occurrence rate.
    5. I'm uncertain as to which step may have led to this issue.
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi daytimemouse

    Looking at your charts I see that most of your mutations are around T>C or C>T. How was this data generated? Looks like bisulfite sequencing. Also you said you performed a germline analysis did you pick up a group of genes from the WES panel or is this whole exome germline analysis? What is the average depth of your samples? How many samples do you have?

    Can you provide more details about your analysis?

     

    0
    Comment actions Permalink
  • Avatar
    daytimemouse

    Hi Gökalp Çelik,

    This issue has been bothering me for nearly two months.

    After reviewing the annotated MAF file, I observed that each sample of the top genes exhibits more than two mutation types. As a result, maftools, an R package, annotates these mutations as "multi_hit" due to the presence of numerous mutations.

    I appreciate you can help me. I hope everything is well with you.

    I utilized WES data to generate the VCF file. Additionally, I used the parameter -L hg38.exon.bed.

    According to the formula "sequencing depth = reads length × mapped reads count / reference sequence length," the sequencing depth I calculated is 3 to 5.

    The following is the specific code I analyzed using snakemake.

    rule fastp:
        input:
            "fastq/{sample}_1.fastq.gz",
            "fastq/{sample}_2.fastq.gz"
        output:
            "2_fastp/{sample}_clean.1.fq.gz",
            "2_fastp/{sample}_clean.2.fq.gz",
            "2_fastp/{sample}_fastp.html",
            "2_fastp/{sample}.json"
        log:
            "log/{sample}_fastp.log"
        shell:
            """
            source activate /home/ljz/miniconda3/envs/rnaseq
            fastp -i {input[0]} -I {input[1]} -o {output[0]} -O {output[1]} -j {output[3]} \
            -w 12 -z 4 -q 20 -u 30 -n 10 -h {output[2]} 2> {log}
            """
     
    rule bwa:
        input:
            "2_fastp/{sample}_clean.1.fq.gz",
            "2_fastp/{sample}_clean.2.fq.gz"
        output:
            "3_bwa/{sample}.sorted.bam"
        params:
            rg=r"@RG\tID:{sample}\tSM:{sample}\tPL:ILLUMINA"
        log:
            "log/{sample}_bwa_align.log"
        shell:
            """
            source activate /home/ljz/miniconda3/envs/rnaseq
            bwa mem -t 5 -M -R '{params.rg}' {fasta} {input[0]} {input[1]} | samtools sort -@ 8 -m 20G -O bam -o {output} - 2> {log}
            """
     
    rule MarkDuplicates:
        input:
           "3_bwa/{sample}.sorted.bam"
        output:
           "4_markdup/{sample}.sorted.markdup.bam",
           "4_markdup/{sample}.markdup_metrics.txt",
           "4_markdup/{sample}.sorted.markdup.bam.bai"
        log:
            "log/{sample}_markdup.log"
        shell:
            """
            source activate /home/ljz/miniconda3/envs/gatk4
            gatk MarkDuplicates -I {input} \
            -M {output[1]}  \
            -O {output[0]} 2> {log}
            samtools index {output[0]}
            """
     
    rule BaseRecalibrator:
        input:
            "4_markdup/{sample}.sorted.markdup.bam",
            "4_markdup/{sample}.sorted.markdup.bam.bai"
        output:
            "5_bqsr/{sample}.recal_data.table"
        log:
            "log/{sample}_bqsr.log"
        shell:
            """
            source activate /home/ljz/miniconda3/envs/gatk4
            gatk --java-options '-Xmx30g' BaseRecalibrator \
            -I {input[0]} -R {fasta} \
            --known-sites {hg38_vcf}/1000G_phase1.snps.high_confidence.hg38.vcf.gz\
            --known-sites {hg38_vcf}/dbsnp_146.hg38.vcf.gz \
            --known-sites {hg38_vcf}/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
            -L {bed} \
            -O {output} 2> {log}
            """
     
    rule ApplyBQSR:
        input:
            "4_markdup/{sample}.sorted.markdup.bam",
            "5_bqsr/{sample}.recal_data.table"
        output:
            "5_bqsr/{sample}.sorted.markdup.BQSR.bam"
        log:
            "log/{sample}_ApplyBQSR.log"
        shell:
            """
            source activate /home/ljz/miniconda3/envs/gatk4
            gatk --java-options '-Xmx30g' ApplyBQSR \
            --bqsr-recal-file {input[1]} \
            -R {fasta} \
            -L {bed} \
            -I {input[0]} -O {output} 2> {log}
            """
    rule HaplotypeCaller:
        input:
            "5_bqsr/{sample}.sorted.markdup.BQSR.bam"
        output:
            "6_gvcf/{sample}.g.vcf.gz"
        log:
            "log/{sample}_HaplotypeCaller.log"
        shell:
            """
            gatk --java-options "-Xmx40g" HaplotypeCaller  \
            -R {fasta} \
            --native-pair-hmm-threads 10 \
            -D {hg38_vcf}/dbsnp_146.hg38.vcf.gz \
            -I {input} \
            -O {output} \
            -L {bed} \
            -ERC GVCF \
            2> {log}
            """
     
    rule GenotypeGVCFs:
        input:
            "6_gvcf/{sample}.g.vcf.gz"
        output:
            "7_genotype/{sample}.vcf.gz"
        log:
            "log/{sample}_GenotypeGVCFs.log"
        shell:
            """
            gatk --java-options "-Xmx40g" GenotypeGVCFs \
            -R {fasta} \
            -V {input} \
            -O {output} \
            2> {log}
            """
     
    rule VQSR_snp:
        input:
            "7_genotype/{sample}.vcf.gz"
        output:
            "8_vqsr/{sample}.snp.recal",
            "8_vqsr/{sample}.snp.tranches",
            "8_vqsr/{sample}.snp.plots.R",
            "8_vqsr/{sample}.snp.VQSR.vcf.gz"
        log:
            "log/{sample}_VQSR_snp.log",
            "log/{sample}_ApplyVQSR_snp.log"
        shell:
            """
            gatk VariantRecalibrator \
            -R {fasta} \
            -V {input} \
            --resource:hapmap,known=false,training=true,truth=true,prior=15.0 {hg38_vcf}/hapmap_3.3.hg38.vcf.gz \
            --resource:omni,known=false,training=true,truth=false,prior=12.0 {hg38_vcf}/1000G_omni2.5.hg38.vcf.gz \
            --resource:1000G,known=false,training=true,truth=false,prior=10.0 {hg38_vcf}/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
            --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {hg38_vcf}/dbsnp_146.hg38.vcf.gz \
            -an QD -an DP -an FS -an SOR -an MQ -an ReadPosRankSum -an MQRankSum \
            -mode SNP \
            -O {output[0]} \
            --tranches-file {output[1]} \
            --rscript-file {output[2]} 2> {log[0]}
            gatk --java-options '-Xmx30g' ApplyVQSR \
            -V {input[0]} \
            -R {fasta} \
            --truth-sensitivity-filter-level 99.5 \
            --tranches-file {output[1]} \
            --recal-file {output[0]} \
            -O {output[3]}\
            --mode SNP 2> {log[1]}
            """
     
    rule VQSR_indel:
        input:
            "7_genotype/{sample}.vcf.gz"
        output:
             "8_vqsr/{sample}.indel.recal",
             "8_vqsr/{sample}.indel.tranches",
             "8_vqsr/{sample}.indel.plots.R",
             "8_vqsr/{sample}.indel.VQSR.vcf.gz"
        log:
            "log/{sample}_VQSR_indel.log",
            "log/{sample}_ApplyVQSR_indel.log"
        shell:
            """
            gatk VariantRecalibrator \
            -R {fasta} -V {input} \
            --max-gaussians 4 \
            --resource:mills,known=false,training=true,truth=true,prior=12.0 {hg38_vcf}/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
            --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {hg38_vcf}/dbsnp_146.hg38.vcf.gz \
            -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
            --mode INDEL \
            --output {output[0]} \
            --tranches-file {output[1]} \
            --rscript-file {output[2]} 2> {log[0]}
            gatk ApplyVQSR \
            -R {fasta} -V {input} -O {output[3]} \
            --truth-sensitivity-filter-level 99.0 \
            --tranches-file {output[1]} \
            --recal-file {output[0]} \
            --mode INDEL 2> {log[1]}
            """
     
    rule MergeVcfs:
        input:
             "8_vqsr/{sample}.snp.VQSR.vcf.gz",
             "8_vqsr/{sample}.indel.VQSR.vcf.gz"
        output:
             "8_vqsr/{sample}.filter.VQSR.vcf.gz"
        log:
            "log/{sample}_MergeVcfs.log"
        shell:
            """
            gatk MergeVcfs \
            -I {input[0]} \
            -I {input[1]} \
            -O {output[0]} 2> {log}
            """
     
    rule gunzip_vcf:
        input:
             "8_vqsr/{sample}.filter.VQSR.vcf.gz"
        output:
             "8_vqsr/{sample}.filter.VQSR.vcf"
        shell:
            """
            source activate /home/ljz/miniconda3/envs/gatk4
            gunzip {input}
            """
     
    rule pass_vcf:
        input:
            "8_vqsr/{sample}.filter.VQSR.vcf"
        output:
            "8_vqsr/{sample}.filter.PASS.VQSR.vcf"
        shell:
            """
            awk '{{if($7 == "PASS" || /^#/)print}}' {input} > {output}
            """
            
    rule vep:
        input:
            "8_vqsr/{sample}.filter.PASS.VQSR.vcf"
        output:
            "9_vep/{sample}.vep.maf"
        log:
            "log/{sample}_vep.log"
        shell:
            """
            vcf2maf.pl --input-vcf {input} \
            --output-maf {output} \
            --ref-fasta {fasta} \
            --vep-data /home/GATK/vep/grch38 \
            --vep-path ~/miniconda3/envs/vep/bin \
            --tumor-id {wildcards.sample} \
            --ncbi-build GRCh38 2> {log}
            """
    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    @Eon this is kind of late but one thing that jumps out at me is you are making your own panel of normals from a small number of samples.  We almost always recommend using the GATK best practices panels, available in this public google bucket: gs://gatk-best-practices.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    daytimemouse 3-5 is an extremely low average sequencing depth.  I don't think it's possible to obtain reliable variant calls in such a case.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk