I have a couple naive technical questions.
(1) I understand the basic command for using Mutect2 is this
gatk Mutect2 \ -R reference.fa \ -I MyTumorBAMfile.bam \ -I MyNormalBAMfile.bam \ -normal normal_sample_name -O somatic.vcf.gz
My first question is, how do I find the name of my sample to put under "normal_sample_name"? I did some digging and I think the name has to match the name in the BAM file. Here is the BAM file header; nothing immediately jumps out to me as being the name of my file.
[jong2@crcfe02 ~/Private]$ samtools view -H 0CSa_S288C_Groups_8May2020.bam
@HD VN:1.6 SO:coordinate
@SQ SN:ref|NC_001133| LN:230218
@SQ SN:ref|NC_001134| LN:813184
@SQ SN:ref|NC_001135| LN:316620
@SQ SN:ref|NC_001136| LN:1531933
@SQ SN:ref|NC_001137| LN:576874
@SQ SN:ref|NC_001138| LN:270161
@SQ SN:ref|NC_001139| LN:1090940
@SQ SN:ref|NC_001140| LN:562643
@SQ SN:ref|NC_001141| LN:439888
@SQ SN:ref|NC_001142| LN:745751
@SQ SN:ref|NC_001143| LN:666816
@SQ SN:ref|NC_001144| LN:1078177
@SQ SN:ref|NC_001145| LN:924431
@SQ SN:ref|NC_001146| LN:784333
@SQ SN:ref|NC_001147| LN:1091291
@SQ SN:ref|NC_001148| LN:948066
@SQ SN:ref|NC_001224| LN:85779
@RG ID:4 LB:lib1 PL:ILLUMINA SM:20 PU:unit1
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem S288C_reference_genome_R64-2-1_20150113/S288C_reference_sequence_R64-2-1_20150113.fsa COLONYS/0CSa-39909795/0CSa_S1_L001_R1_001.fastq COLONYS/0CSa-39909795/0CSa_S1_L001_R2_001.fastq
(2) I ran the command anyway without the flag for the name for the normal as before
[jong2@crcfe02 ~/Private]$ gatk-188.8.131.52/gatk Mutect2 -R RedoReference/S288C_reference_sequence_R64-2-1_20150113.fa -I 18CHa_S288C_Groups_8May2020.bam -I 0CSa_S288C_Groups_8May2020.bam -O 18CHa_Mutect2_0CSaS288CBase_8May2020.vcf
In this case, I wanted to ID the SNPs unique to the tumor sample starting 18CHa relative to the normal sample starting 0CSa.
The command ran as expected and I got my VCF file with about ~25,000 lines.
I didn't quite understand what exactly was in the VCF file, however. Am I to understand that all ~25,000 lines of the VCF are variants that are found in 18CHa that differ from both 0CSa and the reference genome? [It seems like a lot more than I was expecting.]
So for example, line one of my VCF file looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 20 ref|NC_001133| 100 . G C . . AS_SB_TABLE=0,0|0,0;DP=1;ECNT=9;MBQ=0,0;MFRL=0,380;MMQ=60,39;MPOS=50;POPAF=7.30;TLOD=4.20 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:0,1:0.667:1:0,0:0,0:0|1:100_G_C:100:0,0,1,0
How do I understand these rows in my VCF file? Does this mean that 18CHa had a C at this position and neither the Ref (which had a G) nor the normal 0CSa (which had something else) had a C at this position?
Or, in other words, if at Position X the Ref had a G and both normal and tumor had a A -- would this variant be absent from my resulting VCF file?
Thanks in advance,
Please sign in to leave a comment.