Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

mutect2 multi-sample

Answered
0

33 comments

  • Avatar
    David Benjamin

    zhao shilin

     

    1. If I understand it correctly, multi-sample feature is designed for multi-samples from the same patient. So inputted tumor samples were called against all inputted normal samples. There is no "matched-normal" to each tumor sample.

    That's correct.  Tumor samples are assumed to be from the same patient, all normal samples are pooled into a single matched normal (it's as if all normal samples were merged into a single read group), and each tumor is called against this pooled normal.  The effect of joint calling is to combine the local assembly of all tumors and to increase statistical power to find variants with low allele fraction.

    2. I tried to do mutect2 separately (Tumor 1 vs Normal 1, Tumor 2 vs Normal 2) or multi-sample (Tumor1 and Tumor2, vs Normal 1 and Normal 2 in one run). The result (vcf before filtering) is different. Some variants were only identified in multi-sample results, not in any of the separate results. May I know why?

    Greater statistical power.  Suppose a variant shows up at 5% AF in all your tumor samples.  In single-sample mode it might not be possible to distinguish it from sequencing error.

    3. If Point 1 is true, I am thinking that if multi-sample feature can be used in multi-samples from different patients? Seems it is possible to do so? The only issue is no "matched-normal" for tumor in each patient. But if we think any variant identified in normal samples should be excluded in somatic mutation, I think it is OK to use it for multi-samples from different patients?

    This can't be done because the normal samples of all patients will be combined, leading to a loss of sensitivity.  Also, it will do bad things to runtime (and probably accuracy as well) because Mutect2 will have to assemble every tumor sample's variation so, for example, tumor 1's reads would end up being aligned to variant haplotypes only present in tumor 2.  This would scale quadratically with the number of samples.

    Also, the somatic likelihoods model and the somatic clustering model of FilterMutectCalls assume a single patient.  They will misbehave very badly if this assumption is wrong.

    4. If Point 3 is not right, what is the best way to combine mutect2 results from different patients if running separately? Seems like CombineVariants in GATK3 is a solution but it is not available in GATK4?

    Honestly, I have never done this.

    5. Related question. The MuTect2 Wdl in (https://github.com/broadinstitute/gatk/blob/master/scripts/mutect2_wdl/) is not very correct. (1) gatk_jar is defined as /root/gatk.jar, but in fact in gatk docker image it is /gatk/gatk.jar. (2) the mutect2_multi_sample.wdl is not multi-sample as discussed in Point 1. It just run multiple separate MuTect2 (one MuTect2 for each Tumor/Normal pair), not running one multi-sample MuTect2 for all samples.

    1) I'm not so good with docker, but I do know that this WDL has been working for me.  What GATK docker are you using?

    2) That's true, which is why we don't publish that WDL on Terra.  On Terra one can run the regular Mutect2 workflow over a pair set.

    0
    Comment actions Permalink
  • Avatar
    zhao shilin

    Thanks for your very clear reply. 

    For Q4, may I know how you will analyze a group of samples? For example, you have a few tumor samples from different patients, you will use MuTect2 to call somatic mutation individually. But at last you still need to combine them together, for example, reporting hotspot or mutation signature of this group. How will you do such kind of works? Make individual vcf into maf and then combine the maf file? But the maf file lost some information in vcf file.

    For Q5, I tested a lot of GATK docker version. But if you are not author of that WDL file, never mind. 

    Thanks!

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    For Q4, I didn't just mean that I have never combined Mutect2 VCFs with CombineVariants.  I meant that I have never done any downstream analysis of somatic variants from multiple patients.  None of us on the Mutect2 team are biologists by training.  Although we are devoted to giving biologists the best variant calls possible, and learn as much biology as we can in order to do that, we do not personally perform biological analyses.  You probably know the answer to your question better than we do!

    For Q5, here is a link to our featured workspace on Terra: https://app.terra.bio/#workspaces/help-gatk/Somatic-SNVs-Indels-GATK4.  You can verify that it has both a current GATK docker and a WDL with /root/gatk.jar.  We run this constantly for evaluations, along with a lot of users.  Have you tried that docker yet?

    0
    Comment actions Permalink
  • Avatar
    zhao shilin

    Thanks! 

    For Q5, I found it is related to permission issue in singularity (I am using singularity image in HPC, which was converted from docker image). There is a root folder in singularity image but user can't read it because of permission issue.  So never mind.

    0
    Comment actions Permalink
  • Avatar
    Felix

    Hi,

    I am wondering, if there is a benefit in terms of performance using:

    A) MuTect2 in multi-sample mode

    vs.

    B) Manually joining all normal and all tumor samples for a patient and then running MuTect2 on the resulting tumor-normal pair.

    I planned to do B), but after reading your comments A) seems to be the nicer approach, if the performance is comparable.

    Have a nice day,

    Felix

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    @Felix As far as the normals are concerned, running with multiple normals is equivalent in terms of both results and performance to manually joining the normal bam files, so there's no point doing so.

    For the tumors, A) is almost always superior provided that the tumor samples are independent ie taken at different times in the progression of the tumor, taken from different sites etc.  If, however, they are just multiple biopsies of the same primary tumor site at the same time you are better off merging them into a single bam.  This would be no different than combining different read groups from multiplexed sequencing.

    The performance of A) will be worse than B), but not by too much in most cases.

    0
    Comment actions Permalink
  • Avatar
    Felix

    Thank you, David. This was very helpful!

    If I have samples where I cannot find out whether they are independent or not, it would be better to go with A)?

    Have a nice day,

    Felix

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    As long as the samples are from the same person, A) should be your default.

    0
    Comment actions Permalink
  • Avatar
    Felix

    OK. Thank you, David!

    0
    Comment actions Permalink
  • Avatar
    Dandan Zheng

    Hi David,

    I also want to use mutect2 joint calling  with multi-samples. As you said "all normal samples are pooled into a single matched normal (it's as if all normal samples were merged into a single read group), and each tumor is called against this pooled normal.". So how about the tumor samples? Will they merge into one single bam file and call somatic mutations against the merged normal bam? Or will they call  somatic mutations against the merged normal bam one by one? 

    For example, I have three tumor samples: Mock_1T1,Mock_1T2, Mock_1T3, and three normal samples: Mock_1N1, Mock_1N2, Mock_1N3. When i use mutect2, will the Mock_1N1, Mock_1N2, tumor3 merge into one tumor vcf file? 

    $gatk Mutect2 -R mm10.fa -L SureSelect_Mouse_All_Exon_V1_MM10liftover.bed -I Mock_1T1.sorted.rmdup.bam -I Mock_1T2.sorted.rmdup.bam -I Mock_1T3.sorted.rmdup.bam -I Mock_1N1.sorted.rmdup.bam -I Mock_1N2.sorted.rmdup.bam -I Mock_1N3.sorted.rmdup.bam -normal Mock_1N1 -normal Mock_1N2 -normal Mock_1N3 --panel-of-normals 1N_pon.vcf.gz -O 1N_1T_somatic.vcf

    Could you please tell me how to understand the result below?  For each row, is the somatic mutation in only one of the tumor saples? or it must be in all the three tumor samples? 

    I am very confused with this question. I appreciate a lot if you can help me. 

    Have a nice day. 

    Dana.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Hi Dandan Zheng,

    Here's an overview of how multi-sample mode works:

    Normal reads are pooled in-memory.  The bams are not merged, but as far as Mutect2's variant calling is concerned it is as if they were merged.  The only sign that they came from different bams is that they have distinct genotype fields in the output VCF.

    Neither tumor bams nor tumor reads are merged, but Mutect2 uses all reads at once in its local assembly.

    The output is a single VCF with one genotype field for each sample.

    One crucial thing point is that Mutect2 and FilterMutectCalls only make a single call for each variant.  That is, a variant reported as PASS by FilterMutectCalls means that the evidence of all samples taken together suggests that it is a real somatic variant.  The tools do not report a variant as being present in some samples and absent in others because this is often impossible.  Even if a variant allele appears with no reads in some sample, it is completely possible that it was a new mutation with very low allele fraction and was later, in another sample, amplified in a subclonal expansion.  One can come up with more examples like this.  Therefore, we don't feel confident saying anything beyond "this variant represents a somatic mutation that occurred some time in the history of this cancer."

    0
    Comment actions Permalink
  • Avatar
    Vincent Appiah

    Hi All, I have 14  whole exome samples (each from a different patient). At the moment I don't have any normal sample files. So based on the discussions above and some forums I would be grateful if you could advice if my approach below will be suitable

    I call the individual samples separately with Mutect2 using 1000g_pon.hg38.vcf.gz  and af-only-gnomad.hg38.vcf.gz files (available at https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38;tab=objects?prefix)

    Perform downstream analysis ( eg. annotation, etc)

     

    Thank you gatk team

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Vincent Appiah This is all correct.  Samples from different patients must be called separately, as you are doing.

    0
    Comment actions Permalink
  • Avatar
    TAYYABA ALVI

    Hello Everybody, I am a beginner and this is my first time with variant calling. I am very confused and have been searching for days. This thread is helping me understand many things and I have following questions.

    I have normal and tumor samples from different patients(un-matched controls for each sample) and I can see two ways of performing variant analysis. Can you help me which one I choose or suggest me a better way?

    A) (Using tumor-only mode) I call the individual samples separately for both tumor and normal samples using PON and then do further filtration to select variants present only in tumor samples.

    B) If I merge all normal samples and call each tumor sample with merged-normal file separately.( expecting very low sensitivity and accuracy in this case.) 

    Also, Can I create a PON with normal samples(un-matched), I have only 8 normal samples?

    I am very naive in this. So I apologize if my questions seems very stupid.

    Thank you

     

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    TAYYABA ALVI The best thing to do is run tumor-only mode independently for each sample using one of our public PoNs (gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz, gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf, or gs://gatk-best-practices/somatic-b37/Mutect2-WGS-panel-b37.vcf).  The unmatched normals would only be useful if you had many more of them.  For the germline-resource argument you should use gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf or gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz

    0
    Comment actions Permalink
  • Avatar
    TAYYABA ALVI

    Thank You David.

    I tried it but I am getting an error which states unmatched contigs with the reference.I am using GRch38 assembly and instead of ch1,cht,ch3 as identifiers there are only numbers (1. 2.3) for chromosomes. Is there any other PON available which align with my reference genome contigs? or is there any other way I can sort this out.

    Thank you so much for your help.  

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    TAYYABA ALVI If the hg38 PoN doesn't work with an hg38 bam it's likely that the different naming conventions are the only issue, although I'm surprised because we usually only see this with hg19/b37.  If this is the problem then you can unzip the PoN and manually change the contig names.

    For example sed 's/chr//g' pon.vcf > new_pon.vcf turns chr1 into 1 etc.  To turn 1 into chr1 in contig names you could manually edit the contig names in the header, then insert a chr at the beginning of every line.

    After this, you would need to re-index the pon vcf vi "gatk IndexFeatureFile -I new_pon.vcf".

    But before you do this it would make sense to double-check that you are using our hg38 pon with an hg38 bam.

    0
    Comment actions Permalink
  • Avatar
    Vincent Appiah

    TAYYABA ALVI You can use the LiftOverVcf to rename the contigs in the vcf file

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Yes, Vincent Appiah's solution is much more professional than mine!

    0
    Comment actions Permalink
  • Avatar
    TAYYABA ALVI

    Thank you so much! Both methods worked.

    0
    Comment actions Permalink
  • Avatar
    GE

    In a multi-sample analysis with 2 tumor samples vs 1 normal, if a somatic mutation is high quality in one of the tumor samples, but absent from the other tumor sample, will it be called?

    In other words, is the multi-sample calling strongly biased to call variants found in all tumor samples? If so, then I would expect a significant loss rather than gain in sensitivity relative to single-tumor vs normal calling followed by pooling calls.

    The output of multi-sample Mutect2 seems to only call genotypes of 0/1 in ALL tumor samples, suggesting it is not going to call variants that are present in one tumor sample, but absent in the other.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    GE

    In a multi-sample analysis with 2 tumor samples vs 1 normal, if a somatic mutation is high quality in one of the tumor samples, but absent from the other tumor sample, will it be called?

    Yes.

    In other words, is the multi-sample calling strongly biased to call variants found in all tumor samples? 

    No.

    The output of multi-sample Mutect2 seems to only call genotypes of 0/1 in ALL tumor samples,

    Genotypes are not meaningful in the output of Mutect2.  0/1 just means that the variant was found in some sample.

    0
    Comment actions Permalink
  • Avatar
    Alessio Locallo

    Hello,

    I'd like to use Mutect2 in multi-sample mode, as I have data for primary tumors and matched relapses for some cancer patients. I created a PON using almost 200 samples of normal tissues (from the same patients for which I have cancer data), which I'm using for running Mutect2 in single-sample mode. My question is: should I use the PON when I run Mutect2 in multi-sample as well? Or is the PON not necessary when running Mutect2 in multi-sample mode?

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Alessio Locallo,

    Yes, you should still use the normal samples in multi-sample mode, but in this mode, all normal samples are pooled into one matched normal. I hope this helps.

    Best, 

    Pamela

    1
    Comment actions Permalink
  • Avatar
    Alessio Locallo

    Hi Pamela Bretscher,

    thank you very much for your reply.

    0
    Comment actions Permalink
  • Avatar
    D

    If I have 1 tumor sample and 2 normal samples, how will Mutect2 know how to differentiate which files are for the tumor and which are for the normal sample? An example call is this:

     

    ./gatk Mutect2 \

            -R reference.fasta \

            -I /file-path/tumor1/calibrated.bam \

            -I /file-path/normal1/calibrated.bam \

            -I /file-path/normal2/calibrated.bam \

            -normal normal1 \

            -normal normal2 \

            --germline-resource af-only-gnomad.raw.sites.vcf \

            --panel-of-normals pon.vcf.gz \

            -O somatic.vcf.gz

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi D,

    This document may help clarify: https://gatk.broadinstitute.org/hc/en-us/articles/360035889791--How-to-Call-somatic-mutations-using-GATK4-Mutect2-Deprecated-

    Please let me know if this does not answer your question.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    D the -normal argument specifies the normal samples; all others are presumed to be tumor samples.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Alessio Locallo  

    My question is: should I use the PON when I run Mutect2 in multi-sample as well? Or is the PON not necessary when running Mutect2 in multi-sample mode?

    You should still use a panel of normals.  Matched normals served a different purpose for the most part, though there is some overlap in what they do.

    1
    Comment actions Permalink
  • Avatar
    Alexandre Mondaini

    Hello GATK team,

    Please help me clarify if I understood correctly this thread.

    Say I have several tumor samples and only one normal from different patients.

    For example:

    patient1 normal1.bam tumor1.bam tumor2.bam tumor3.bam

    patient2 normal2.bam tumor4.bam tumor5.bam tumor6.bam

     

    Should I run mutect2 in multi_sample mode ? If I understood correctly the normals will be pooled and since I have different patients this will most likely be detrimental to my analysis ?

    I would be better off merging the tumor bam files and running in tumor/normal pair mode correct ?

    Thanks

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk