Mutect2 Questions
Hello,
I have a couple naive technical questions.
(1) I understand the basic command for using Mutect2 is this
gatk Mutect2 \ -R reference.fa \ -I MyTumorBAMfile.bam \ -I MyNormalBAMfile.bam \ -normal normal_sample_name -O somatic.vcf.gz
My first question is, how do I find the name of my sample to put under "normal_sample_name"? I did some digging and I think the name has to match the name in the BAM file. Here is the BAM file header; nothing immediately jumps out to me as being the name of my file.
[jong2@crcfe02 ~/Private]$ samtools view -H 0CSa_S288C_Groups_8May2020.bam
@HD VN:1.6 SO:coordinate
@SQ SN:ref|NC_001133| LN:230218
@SQ SN:ref|NC_001134| LN:813184
@SQ SN:ref|NC_001135| LN:316620
@SQ SN:ref|NC_001136| LN:1531933
@SQ SN:ref|NC_001137| LN:576874
@SQ SN:ref|NC_001138| LN:270161
@SQ SN:ref|NC_001139| LN:1090940
@SQ SN:ref|NC_001140| LN:562643
@SQ SN:ref|NC_001141| LN:439888
@SQ SN:ref|NC_001142| LN:745751
@SQ SN:ref|NC_001143| LN:666816
@SQ SN:ref|NC_001144| LN:1078177
@SQ SN:ref|NC_001145| LN:924431
@SQ SN:ref|NC_001146| LN:784333
@SQ SN:ref|NC_001147| LN:1091291
@SQ SN:ref|NC_001148| LN:948066
@SQ SN:ref|NC_001224| LN:85779
@RG ID:4 LB:lib1 PL:ILLUMINA SM:20 PU:unit1
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem S288C_reference_genome_R64-2-1_20150113/S288C_reference_sequence_R64-2-1_20150113.fsa COLONYS/0CSa-39909795/0CSa_S1_L001_R1_001.fastq COLONYS/0CSa-39909795/0CSa_S1_L001_R2_001.fastq
(2) I ran the command anyway without the flag for the name for the normal as before
[jong2@crcfe02 ~/Private]$ gatk-4.1.7.0/gatk Mutect2 -R RedoReference/S288C_reference_sequence_R64-2-1_20150113.fa -I 18CHa_S288C_Groups_8May2020.bam -I 0CSa_S288C_Groups_8May2020.bam -O 18CHa_Mutect2_0CSaS288CBase_8May2020.vcf
In this case, I wanted to ID the SNPs unique to the tumor sample starting 18CHa relative to the normal sample starting 0CSa.
The command ran as expected and I got my VCF file with about ~25,000 lines.
I didn't quite understand what exactly was in the VCF file, however. Am I to understand that all ~25,000 lines of the VCF are variants that are found in 18CHa that differ from both 0CSa and the reference genome? [It seems like a lot more than I was expecting.]
So for example, line one of my VCF file looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 20 ref|NC_001133| 100 . G C . . AS_SB_TABLE=0,0|0,0;DP=1;ECNT=9;MBQ=0,0;MFRL=0,380;MMQ=60,39;MPOS=50;POPAF=7.30;TLOD=4.20 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:0,1:0.667:1:0,0:0,0:0|1:100_G_C:100:0,0,1,0
How do I understand these rows in my VCF file? Does this mean that 18CHa had a C at this position and neither the Ref (which had a G) nor the normal 0CSa (which had something else) had a C at this position?
Or, in other words, if at Position X the Ref had a G and both normal and tumor had a A -- would this variant be absent from my resulting VCF file?
Thanks in advance,
Joseph
-
Joseph Ong Thank you for organizing your questions so thoroughly and clearly. This is very helpful to us.
There are a few good ways to find the sample name:
(i) In the bam header the sample is contained in the read group line. For example, the sample is "20" below:
@RG ID:4 LB:lib1 PL:ILLUMINA SM:20 PU:unit1
Note that a bam may have multiple read groups all with different sample or with the same sample.
(ii) There is a GATK tool GetSampleName. You can run it as follows:
gatk GetSampleName -I normal.bam -O normal_sample.txt
(iii) If you give Mutect2 a bogus sample name it gives you an error message with the actual sample names. For example,
gatk Mutect2 -R ref.fasta -I tumor.bam -I normal.bam -normal bogus_name -O out.vcf
gives an error like "Sampel name bogus is not in the list of sample names [tumor_sample, normal_sample]
Running Mutect2 without the -normal flag means Mutect2 runs in multi-sample tumor-only mode (any sample not explicitly stated to be normal is assumed to be a tumor sample). Thus the 25,000 variants you found are sites where either 18CHa or 0CSa differ from the reference.
Finally, although the -germline-resource option is technically optional, for all human calling you should use our af-only gnomAD vcf. Depending on your reference these are in the following public google buckets:
gs://gatk-best-practices/somatic-b37/af-only-gnomad.raw.sites.vcf
gs://gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz
-
David Benjamin -- whatever they are paying you, it is not enough. Thank you very much for your help.
- Running Mutect2 without the -normal flag means Mutect2 runs in multi-sample tumor-only mode (any sample not explicitly stated to be normal is assumed to be a tumor sample). Thus the 25,000 variants you found are sites where either 18CHa or 0CSa differ from the reference.
I was wondering for a very long time why the number of somatic variants identified was so large (it was, in fact, the same as when I just did regular variant calling and not somatic variant calling). I knew I had ~25k variants between my genomes and the reference genome, so I figured something was wrong when I was seeing ~25k somatic variants.
---------------
So if I interpreted everything correctly, the name for the BAM file (the normal in this case, 0CSa) was 20. So you are saying I should run the command like this:
[jong2@crcfe02 ~/Private]$ gatk-4.1.7.0/gatk Mutect2 -R RedoReference/S288C_reference_sequence_R64-2-1_20150113.fa -I 18CHa_S288C_Groups_8May2020.bam -I 0CSa_S288C_Groups_8May2020.bam -normal [NAME HERE, which was 20 from the above example] -O 18CHa_Mutect2_0CSaS288CBase_8May2020.vcf
Unfortunately, both my normal and my tumor are both named "20".
Would you suggest that I change the name of my BAM files via something like this or this so they all have unique names?
Please pardon the questions. I'm self-learning which has been rewarding but challenging.
Thanks,
Joseph
-
That's right, you will need the tumor and normal to have different sample names, and samtools reheader is a good way to do this. If something goes wrong, please let us know, and good luck with the self-education.
-
Hi David Benjamin, I got it to work and my data looks reasonable! Thank you very much!! :)
Please sign in to leave a comment.
4 comments