Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

(How to) Call somatic mutations using GATK4 Mutect2 Follow

19 comments

  • Avatar
    Brian Haas

    Can I please confirm that this is the logic being used in the current PoN creation step:

    ====

    Any candidate PoN site must occur in at least 2 samples (as per the default for --min-sample-count).

    If the site is not in gnomad (or at negligible frequency), it's automatically treated as a candidate site for PoN inclusion.

    If a site IS in gnomad at a non-negligible frequency, then it computes the probability of it being a germline variant.  If the p(germline) < 0.5, then it's a candidate. (this 0.5 is manually tunable via cmd line parameter: --max-germline-probability

    ====

    and if this is true, can we update the code documentation at:
    https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/CreateSomaticPanelOfNormals.java

    to reflect that germline variants are not intended to be captured now during the PoN creation step?

     

    1
    Comment actions Permalink
  • Avatar
    Brian Haas

    Also, can you please confirm that the panel of normals is excluding variants based on chromosome position and not requiring allele-specific matching?   ie.   a G->A variant in the tumor sample may be filtered if there's a G->T entry at that chromosomal position in the panel of normals.  I've seen examples of this and just want to confirm that this is the expected behavior.  Thx in advance.

    0
    Comment actions Permalink
  • Avatar
    Nickier

    Hi, I just ran the GenomicsDBImport to my exome samples (17 Germline samples). it has run 3 days and create some tmp file occupying 1.5 Tb and its still on chr4! Is there any solution

     

    0
    Comment actions Permalink
  • Avatar
    Nickier

    Hi, when I run GetPileupSummaries , I get this error: 

    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.BitSet.initWords(BitSet.java:166)
    at java.util.BitSet.<init>(BitSet.java:161)
    at htsjdk.samtools.GenomicIndexUtil.regionToBins(GenomicIndexUtil.java:164)
    at htsjdk.samtools.BinningIndexContent.getChunksOverlapping(BinningIndexContent.java:121)
    at htsjdk.samtools.CachingBAMFileIndex.getSpanOverlapping(CachingBAMFileIndex.java:75)
    at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
    at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
    at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
    at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:405)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.loadNextIterator(SamReaderQueryingIterator.java:125)
    at org.broadinstitute.hellbender.utils.iterators.SamReaderQueryingIterator.<init>(SamReaderQueryingIterator.java:66)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.prepareIteratorsForTraversal(ReadsDataSource.java:416)
    at org.broadinstitute.hellbender.engine.ReadsDataSource.iterator(ReadsDataSource.java:342)
    at java.lang.Iterable.spliterator(Iterable.java:101)

    How to solve this problem

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Hi Nickier, I had the same problem and figured out it's basically related to how contigs are passed from the BED file. If the BED (or interval_list as they call it) has a lot of different little intervals on every contig/chr you are analyzing, you'll see in the first lines of GenomicsDBImport output a warning about that. The solution is to insert --merge-input-interval in the GenomicsDBImport command

    2
    Comment actions Permalink
  • Avatar
    Mateo Kee

    Hello, I have a question regarding the `-tumor-segmentation` option in CalculateContamination:

        gatk CalculateContamination \
            -I getpileupsummaries.table \
            -tumor-segmentation segments.table \
            -O calculatecontamination.table
    

    - gatk website states it "output table containing segmentation of the tumor by minor allele fraction"

    - does this mean modeling possible mixture of 2 samples?

    - this wasn't done in the tutorial for the last version (https://gatk.broadinstitute.org/hc/en-us/articles/360035889791?id=11136), where below is the command:

    gatk CalculateContamination \
        -I 7_tumor_getpileupsummaries.table \
        -O 8_tumor_calculatecontamination.table

    - is `-tumor-segmentation` a recommended option now? how does it help in calculating contamination?

     

    another related question, what does the [ ] (square brackets) mean here? are they optional?

        gatk FilterMutectCalls -V unfiltered.vcf \
            [--tumor-segmentation segments.table] \
            [--contamination-table contamination.table] \
            --ob-priors read-orientation-model.tar.gz \
            -O filtered.vcf

    - if I run the above FilterMutectCalls, does it take care of both *cross-sample contamination* filtering and *orientation bias* filtering?

    3
    Comment actions Permalink
  • Avatar
    Nickier

    Hi, Enrico Cocchi , you are right. It makes effect when I add the argument --merge-input-interval . Thanks a lot.

    1
    Comment actions Permalink
  • Avatar
    that girl

    when will gatk support calling complex variants, for example variants in EGFR,I do not know after so many versions of updating of gatk, this requirement is still not satisfied

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Panel-of-Normals file generation, GenomicsDBImport file merge, I got the following error message several times:

    [E::bcf_get_info_values] Trying to get 32-bit int data from a field which contains 64 bit values

    Do you have any clue why?

     

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    Hi, can you please explain what's exactly the role of -tumor-segmentation segments.table in CalculateContamination and how we're supposed to generate such table?

    0
    Comment actions Permalink
  • Avatar
    Yuna Son

    Hi, I got this error when I add more than 2 vcf files for GenomicsDBImport.

    This is my code and I am not sure why it is recognized as duplicate files.

    I added my vcf files this way:

    -V HDF1_TA1_S4_normal.vcf.gz \

    -V HDF1_TA2_S5_normal.vcf.gz \

    -V HDF1_TA3_S6_normal.vcf.gz \

    -V HDF1_UT1_S1_normal.vcf.gz

    And, this is the error I got:

    A USER ERROR has occurred: Duplicate sample: 1. Sample was found in both file:///gpfs/research/medicine/sequencer/NovaSeq/Outputs_fastq/2020_Outputs/Akash_Gunjan_05-19-2020_Yuna-samples/GATK_Bamfiles/Preprocessed_Bam/PON/HDF1_TA2_S5_normal.vcf.gz and HDF1_TA1_S4_normal.vcf.gz

     

    0
    Comment actions Permalink
  • Avatar
    Vincent Appiah

    The FilterMutectCalls requires a reference argument. But this is omitted in the above commands. Is it deliberate or it can be added.

    0
    Comment actions Permalink
  • Avatar
    Ashi

    I could not understand this new pipeline of Mutect2 (4.1.1.0, or newer).

    My samples are tumor with matched normal tissue, WGS. Ref is hg38. Tissue type is FFPE.

    Could anyone tell me how to call somatic event? 

    My understanding is like this (let's say i have 3 paired samples) Looks wired:

    gatk Mutect2 -R ref.fasta \
            -L intervals.interval_list \
            -I tumor1.bam \
    -I tumor2.bam \
    -I tumor3.bam \
    -I normal1.bam \
    -I normal2.bam \
    -I normal3.bam \
    -normal normal_sample_name1
    -normal normal_sample_name2
    -normal normal_sample_name3
    -germline-resource af-only-gnomad.vcf \ -pon panel_of_normals.vcf \ --f1r2-tar-gz f1r2.tar.gz \ -O unfiltered.vcf

     

     

    0
    Comment actions Permalink
  • Avatar
    林林达

    I use this I use this command to generate pon.vcf.gz,

    "gatk CreateSomaticPanelOfNormals -R /home/dlin/reference/bwa-ref-hg19/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --germline-resource af-only-gnomad.vcf.gz -V gendb://pon_db -O pon.vcf.gz"

    but i can not find "af-only-gnomad.vcf.gz" file for hg19, can anybody tell me how to dowload this file?

    0
    Comment actions Permalink
  • Avatar
    Enrico Cocchi

    I have a problem with PoN generation Step 2: GenomicsDBImport step creates EMPTY databases from Mutect2 called VCFs: here the link to the Biostar post I created about this. Please comment if you have any idea about this!

    0
    Comment actions Permalink
  • Avatar
    杜鹏

    Hello, I use the 'Scatter Gather' mode, and run mutect2 in parallel on separate chromosomes, but there is a consistency problem between the results obtained by using the 'Scatter Gather' mode and the results obtained by running without this mode. Excuse me, how to ensure consistent use of the 'Scatter Gather' mode.
    https://github.com/broadinstitute/gatk/issues/8152

    -1
    Comment actions Permalink
  • Avatar
    Zeeshan Fazal

    Hi GATK Team,

    I am working on somatic variant calling for my WES data. We have created our own panel of normals and I followed all the above listed steps to create panels of normals , Mutect2 calling and Filtermutectcall.

    In filtering stage, I did two types of filtering: 1: one with basic filtering where Filtermutectcall was run with our orientation model, contamination table, 2: and second filtering with contamination table, orientation model and I called this Advanced Filtering etc.

    I have few questions: 

    1: Why mutect2 is calling somatic variants which as VAF close to 1? I generated a distribution of VAF for my samples and I saw two spikes: 1 spike between 0 to 0.25 VAF and another spike between 0.9-1VAF. Why mutect2 is calling somatic variants at 0.9-1 VAF range? Here is the example distribution graph.

    2: With Advanced filtering (Filtermutectcall with contamination table, orientation model etc), I found few samples have a spike at 0.5 VAF. Why mutetc2 calling somatic variants with 0.5VAF and all those variants are on CHR6 near HLA region. I used GRCH38 ref and it has alt coatings in it.

    I will really appreciate your response.

    Thanks

    Zeeshan

     

     

     

    1
    Comment actions Permalink
  • Avatar
    Zeeshan Fazal

    There is a typo in above comment: "In filtering stage, I did two types of filtering: 1: one with basic filtering where Filtermutectcall was run with our orientation model, contamination table" 

    correction: 

    In filtering stage, I did two types of filtering: 1: one with basic filtering where Filtermutectcall was run without orientation model, contamination table" 

     

    0
    Comment actions Permalink
  • Avatar
    昨日旧梦

    I have a question about how to remove germline mutations from somatic vcf files caused by Mutect2 tool, is it based on AF information? How to choose the threshold?The command I used is "gatk Mutect2 -R Homo_sapiens_assembly38.fasta -I xx.bam --germline-resource somatic_hg38_af-only-gnomad.hg38.vcf.gz --panel-of-normals somatic_hg38_1000g_pon.hg38.vcf.gz -O xx.vcf". These  somatic_hg38_af-only-gnomad.hg38.vcf.gz and  somatic_hg38_1000g_pon.hg38.vcf.gz I downloaded from the google cloud provided by gatk. Has Mutect2 considered the AF frequency to annotate when performing the inclusion of germline parameters? Will there be any overlap between this annotation for germline mutations and actual somatic mutations. Can I assume that this is a germline mutation if it is labeled germline after mutect2, and can I just filter out the loci labeled as germline mutations and consider the remaining loci as possible somatic mutations?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk