Confusing Mutect2 F1R2 (_f1r2.tar.gz) outputs
Hello,
I am running a somatic variant calling pipeline for the first time and had some questions regarding the F1R2 outputs from calling Mutect2 with --f1r2-tar-gz to feed into LearnReadOrientationModel.
a) GATK version used: v4.4.0.0. I've been following GATK Best Practices tutorials (How to) Call somatic mutations using GATK4 Mutect2 and Somatic short variant discovery (SNVs + Indels), which have been very useful.
b) Exact command used: I'm running Mutect2 in matched tumor-normal mode, scattering by chromosome for efficiency. Here's the relevant excerpt from my batch script:
gatk Mutect2 \
-R $ref -I $bam1 -I $bam2 \
-normal $(samtools head $bam2 | grep -o SM:[^\ ]* -m1 | sed 's/^SM://') \
-O ${tmp}/${chr}.vcf.gz \
--f1r2-tar-gz ${tmp}/${chr}_f1r2.tar.gz \
-L $chr \
--max-mnp-distance 0 \
--germline-resource $gnomAD \
--panel-of-normals $pon \
--java-options "-Xmx2g" \
--tmp-dir $tmp
c) Entire program log: Everything seems to run successfully; I get usable VCF files for each chromosome (which I have been able to merge successfully) and the tool returns SUCCESS.
Since I'm scattering by chromosome, I get a separate ${chr}_f1r2.tar.gz for each. These all seem to have parsed the BAM files appropriately, but the histogram tables are entirely zeros. Here's a representative excerpt (for sample #001):
./001.alt_table0100644 0012426 0014051 00000000116 14627666571 016560 0ustar00[usr][usr]0000000 0000000 14627655624 14627666571 #<METADATA>SAMPLE=001
context ref_count alt_count ref_f1r2 alt_f1r2 depth alt
./001.ref_histogram0100644 0012426 0014051 00000064012 14627666571 017472 0ustar00[usr][usr]0000000 0000000 14627666571 14627666571 ## htsjdk.samtools.metrics.StringHeader
# 001
## HISTOGRAM java.lang.Integer
depth ATT CTT GTT TAT AAA CAA AAC CAC GAA AAG CAG GAC GAG TGA TGC TCA AAT TCC TGG CAT TCG GAT TGT TTA TTC TCT TTG TTT AGA CGA AGC CGC ACA CCA GGA ACC AGG CCC CGG GGC GCA ACG CCG GCC GGG GCG AGT ATA ATC CGT CTA ACT CTC ATG CCT GGT GTA TAA CTG GTC TAC GCT GTG TAG
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[…]
200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
./001.alt_histogram0100644 0012426 0014051 00000465612 14627666571 017511 0ustar00[usr][usr]0000000 0000000 14627666571 14627666571 ## htsjdk.samtools.metrics.StringHeader
# 001
## HISTOGRAM java.lang.Integer
depth ATT_A_F2R1 ATT_C_F1R2 ATT_G_F2R1 ATT_A_F1R2 ATT_C_F2R1 ATT_G_F1R2 CTT_A_F2R1 CTT_C_F1R2 CTT_G_F2R1 CTT_A_F1R2 CTT_C_F2R1 CTT_G_F1R2 GTT_A_F2R1 GTT_C_F1R2 GTT_G_F2R1 GTT_A_F1R2 GTT_C_F2R1 GTT_G_F1R2 TAT_C_F1R2 TAT_G_F2R1 TAT_T_F2R1 TAT_C_F2R1 TAT_G_F1R2 TAT_T_F1R2 AAA_C_F1R2 AAA_G_F2R1 AAA_T_F2R1 AAA_C_F2R1 AAA_G_F1R2 AAA_T_F1R2 CAA_C_F1R2 CAA_G_F2R1 CAA_T_F2R1 CAA_C_F2R1 CAA_G_F1R2 CAA_T_F1R2 AAC_C_F1R2 AAC_G_F2R1 AAC_T_F2R1 AAC_C_F2R1 AAC_G_F1R2 AAC_T_F1R2 CAC_C_F1R2 CAC_G_F2R1 CAC_T_F2R1 CAC_C_F2R1 CAC_G_F1R2 CAC_T_F1R2 GAA_C_F1R2 GAA_G_F2R1 GAA_T_F2R1 GAA_C_F2R1 GAA_G_F1R2 GAA_T_F1R2 AAG_C_F1R2 AAG_G_F2R1 AAG_T_F2R1 AAG_C_F2R1 AAG_G_F1R2 AAG_T_F1R2 CAG_C_F1R2 CAG_G_F2R1 CAG_T_F2R1 CAG_C_F2R1 CAG_G_F1R2 CAG_T_F1R2 GAC_C_F1R2 GAC_G_F2R1 GAC_T_F2R1 GAC_C_F2R1 GAC_G_F1R2 GAC_T_F1R2 GAG_C_F1R2 GAG_G_F2R1 GAG_T_F2R1 GAG_C_F2R1 GAG_G_F1R2 GAG_T_F1R2 TGA_A_F2R1 TGA_C_F1R2 TGA_T_F2R1 TGA_A_F1R2 TGA_C_F2R1 TGA_T_F1R2 TGC_A_F2R1 TGC_C_F1R2 TGC_T_F2R1 TGC_A_F1R2 TGC_C_F2R1 TGC_T_F1R2 TCA_A_F2R1 TCA_G_F2R1 TCA_T_F2R1 TCA_A_F1R2 TCA_G_F1R2 TCA_T_F1R2 AAT_C_F1R2 AAT_G_F2R1 AAT_T_F2R1 AAT_C_F2R1 AAT_G_F1R2 AAT_T_F1R2 TCC_A_F2R1 TCC_G_F2R1 TCC_T_F2R1 TCC_A_F1R2 TCC_G_F1R2 TCC_T_F1R2 TGG_A_F2R1 TGG_C_F1R2 TGG_T_F2R1 TGG_A_F1R2 TGG_C_F2R1 TGG_T_F1R2 CAT_C_F1R2 CAT_G_F2R1 CAT_T_F2R1 CAT_C_F2R1 CAT_G_F1R2 CAT_T_F1R2 TCG_A_F2R1 TCG_G_F2R1 TCG_T_F2R1 TCG_A_F1R2 TCG_G_F1R2 TCG_T_F1R2 GAT_C_F1R2 GAT_G_F2R1 GAT_T_F2R1 GAT_C_F2R1 GAT_G_F1R2 GAT_T_F1R2 TGT_A_F2R1 TGT_C_F1R2 TGT_T_F2R1 TGT_A_F1R2 TGT_C_F2R1 TGT_T_F1R2 TTA_A_F2R1 TTA_C_F1R2 TTA_G_F2R1 TTA_A_F1R2 TTA_C_F2R1 TTA_G_F1R2 TTC_A_F2R1 TTC_C_F1R2 TTC_G_F2R1 TTC_A_F1R2 TTC_C_F2R1 TTC_G_F1R2 TCT_A_F2R1 TCT_G_F2R1 TCT_T_F2R1 TCT_A_F1R2 TCT_G_F1R2 TCT_T_F1R2 TTG_A_F2R1 TTG_C_F1R2 TTG_G_F2R1 TTG_A_F1R2 TTG_C_F2R1 TTG_G_F1R2 TTT_A_F2R1 TTT_C_F1R2 TTT_G_F2R1 TTT_A_F1R2 TTT_C_F2R1 TTT_G_F1R2 AGA_A_F2R1 AGA_C_F1R2 AGA_T_F2R1 AGA_A_F1R2 AGA_C_F2R1 AGA_T_F1R2 CGA_A_F2R1 CGA_C_F1R2 CGA_T_F2R1 CGA_A_F1R2 CGA_C_F2R1 CGA_T_F1R2 AGC_A_F2R1 AGC_C_F1R2 AGC_T_F2R1 AGC_A_F1R2 AGC_C_F2R1 AGC_T_F1R2 CGC_A_F2R1 CGC_C_F1R2 CGC_T_F2R1 CGC_A_F1R2 CGC_C_F2R1 CGC_T_F1R2 ACA_A_F2R1 ACA_G_F2R1 ACA_T_F2R1 ACA_A_F1R2 ACA_G_F1R2 ACA_T_F1R2 CCA_A_F2R1 CCA_G_F2R1 CCA_T_F2R1 CCA_A_F1R2 CCA_G_F1R2 CCA_T_F1R2 GGA_A_F2R1 GGA_C_F1R2 GGA_T_F2R1 GGA_A_F1R2 GGA_C_F2R1 GGA_T_F1R2 ACC_A_F2R1 ACC_G_F2R1 ACC_T_F2R1 ACC_A_F1R2 ACC_G_F1R2 ACC_T_F1R2 AGG_A_F2R1 AGG_C_F1R2 AGG_T_F2R1 AGG_A_F1R2 AGG_C_F2R1 AGG_T_F1R2 CCC_A_F2R1 CCC_G_F2R1 CCC_T_F2R1 CCC_A_F1R2 CCC_G_F1R2 CCC_T_F1R2 CGG_A_F2R1 CGG_C_F1R2 CGG_T_F2R1 CGG_A_F1R2 CGG_C_F2R1 CGG_T_F1R2 GGC_A_F2R1 GGC_C_F1R2 GGC_T_F2R1 GGC_A_F1R2 GGC_C_F2R1 GGC_T_F1R2 GCA_A_F2R1 GCA_G_F2R1 GCA_T_F2R1 GCA_A_F1R2 GCA_G_F1R2 GCA_T_F1R2 ACG_A_F2R1 ACG_G_F2R1 ACG_T_F2R1 ACG_A_F1R2 ACG_G_F1R2 ACG_T_F1R2 CCG_A_F2R1 CCG_G_F2R1 CCG_T_F2R1 CCG_A_F1R2 CCG_G_F1R2 CCG_T_F1R2 GCC_A_F2R1 GCC_G_F2R1 GCC_T_F2R1 GCC_A_F1R2 GCC_G_F1R2 GCC_T_F1R2 GGG_A_F2R1 GGG_C_F1R2 GGG_T_F2R1 GGG_A_F1R2 GGG_C_F2R1 GGG_T_F1R2 GCG_A_F2R1 GCG_G_F2R1 GCG_T_F2R1 GCG_A_F1R2 GCG_G_F1R2 GCG_T_F1R2 AGT_A_F2R1 AGT_C_F1R2 AGT_T_F2R1 AGT_A_F1R2 AGT_C_F2R1 AGT_T_F1R2 ATA_A_F2R1 ATA_C_F1R2 ATA_G_F2R1 ATA_A_F1R2 ATA_C_F2R1 ATA_G_F1R2 ATC_A_F2R1 ATC_C_F1R2 ATC_G_F2R1 ATC_A_F1R2 ATC_C_F2R1 ATC_G_F1R2 CGT_A_F2R1 CGT_C_F1R2 CGT_T_F2R1 CGT_A_F1R2 CGT_C_F2R1 CGT_T_F1R2 CTA_A_F2R1 CTA_C_F1R2 CTA_G_F2R1 CTA_A_F1R2 CTA_C_F2R1 CTA_G_F1R2 ACT_A_F2R1 ACT_G_F2R1 ACT_T_F2R1 ACT_A_F1R2 ACT_G_F1R2 ACT_T_F1R2 CTC_A_F2R1 CTC_C_F1R2 CTC_G_F2R1 CTC_A_F1R2 CTC_C_F2R1 CTC_G_F1R2 ATG_A_F2R1 ATG_C_F1R2 ATG_G_F2R1 ATG_A_F1R2 ATG_C_F2R1 ATG_G_F1R2 CCT_A_F2R1 CCT_G_F2R1 CCT_T_F2R1 CCT_A_F1R2 CCT_G_F1R2 CCT_T_F1R2 GGT_A_F2R1 GGT_C_F1R2 GGT_T_F2R1 GGT_A_F1R2 GGT_C_F2R1 GGT_T_F1R2 GTA_A_F2R1 GTA_C_F1R2 GTA_G_F2R1 GTA_A_F1R2 GTA_C_F2R1 GTA_G_F1R2 TAA_C_F1R2 TAA_G_F2R1 TAA_T_F2R1 TAA_C_F2R1 TAA_G_F1R2 TAA_T_F1R2 CTG_A_F2R1 CTG_C_F1R2 CTG_G_F2R1 CTG_A_F1R2 CTG_C_F2R1 CTG_G_F1R2 GTC_A_F2R1 GTC_C_F1R2 GTC_G_F2R1 GTC_A_F1R2 GTC_C_F2R1 GTC_G_F1R2 TAC_C_F1R2 TAC_G_F2R1 TAC_T_F2R1 TAC_C_F2R1 TAC_G_F1R2 TAC_T_F1R2 GCT_A_F2R1 GCT_G_F2R1 GCT_T_F2R1 GCT_A_F1R2 GCT_G_F1R2 GCT_T_F1R2 GTG_A_F2R1 GTG_C_F1R2 GTG_G_F2R1 GTG_A_F1R2 GTG_C_F2R1 GTG_G_F1R2 TAG_C_F1R2 TAG_G_F2R1 TAG_T_F2R1 TAG_C_F2R1 TAG_G_F1R2 TAG_T_F1R2
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[…]
200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As a result, when I run LearnReadOrientationModel on all the scattered ${chr}_f1r2.tar.gz files together, the "f[12]r[21]_[acgt]" column entries seem to be assigned randomly to either 0.0 or 0.1, but the "hom_ref", "germline_het", "somatic_het", and "hom_var" column entries are uniformly set to 0.1, and the "num_examples" and "num_alt_examples" column entries are uniformly set to 0. I suspect this is not the expected output.
I assume that the histogram in the ${chr}_f1r2.tar.gz files should not be full of only 0s, so is there some issue with reading/collecting counts from the BAM files that I need to fix? I don't quite understand how to parse these output files (I've consulted the documentation for LearnReadOrientationModel, but that did not seem to provide specifics).
Any insights would be greatly appreciated! Thank you for your time.
-
As a follow-up, I'm also getting similar results when I run GetPileupSummaries (for input into CalculateContamination); here is the relevant command:
gatk GetPileupSummaries \
-I $bam1 \
-V $sites \
-L $sites \
-O $out.table
Here, $sites refers to "small_exac_common_3.hg38.vcf.gz", which I downloaded from the GATK Google bucket as instructed in this forum post. I'm using the "af-only-gnomAD.vcf.gz" reference for input into Mutect2 itself (referenced as $gnomAD in my original post above), if that matters.
Representative output:
#<METADATA>SAMPLE=001
contig position ref_count alt_count other_alt_count allele_frequency
1 17365 0 0 0 0.136
1 17385 0 0 0 0.122
1 69761 0 0 0 0.113
1 187887 0 0 0 0.136
1 187907 0 0 0 0.122
1 962358 0 0 0 0.08
1 963087 0 0 0 0.052
1 973034 0 0 0 0.058
The "ref_count", "alt_count", and "other_alt_count" column entries are all 0s. This happens for every sample.
I should note that because I aligned these BAMs to the GRCh38 reference, I had to modify the hg38-derived sites files to strip the "chr" prefixes from each contig (and reindex the files accordingly). I'm not sure if that could be the cause of these discrepancies. I am, however, still getting a large number of variant calls per sample, so that at least seems to be fine.
If anyone has any ideas why these outputs are full of zeros, please let me know! Thanks.
-
Hi lreska3m
Where did you get the hg38 reference genome? Are you using the one from our resource bundle?
-
Hello Gökalp Çelik,
Thanks for reaching out. Let me try to clarify:
- I aligned my data to the GRCh38 reference genome, using the annotation GTF from Ensembl release 111.
- I downloaded both "af-only-gnomad.hg38.vcf.gz" and "small_exac_common_3.hg38.vcf.gz" from https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38 (following advice in this post and this post).
- When I tried to run Mutect2 and Calculate Contamination as above, I received errors due to incompatible contig names. I then modified these files to match my reference genome by stripping the "chr" prefixes from each contig and reindexing the files with GRCh38-compliant chromosome names. Both tools ran successfully after this modification, but they produced the 0-filled output tables shown above.
I have two questions, essentially:
- Is there some basic reason why these tools would produce output tables full of 0s? I realize I may have overlooked some compatibility issue when I modified the hg38-derived files.
- Alternatively, is there a GRCh38-compliant version of "af-only-gnomAD.vcf.gz" and "small_exac_common_3.vcf.gz" in the GATK resource bundle? I would prefer to keep aligning to the GRCh38 reference as that is my current pipeline standard, if possible.
Any insights would be greatly appreciated! Thanks.
-
Hi again.
Our resource bundle reference genomes and other accessory files are set to be compatible 100% for all tasks that we publish. On the other hand other reference genomes may have incompatible reference nucleotides, masked/unmasked duplicate regions where some of your reads could easily be siphoned into. Therefore we recommend using our own reference genomes or if you are using your preferred reference genome make sure that you convert our resource files to become compatible with your reference genome.
To understand better we may need to check the sequence dictionary of your reference genome. If you can post it here it would be great.
Especially for those 0 scoring pileup sites can you share a couple of them from IGV views so that we can see what is going on.
One more thing. Is this a RNAseq experiment? Which aligner are you using? There are certain things related to using different aligners other than BWA MEM or dragmap. STAR aligner uses 255 MAPQ score for unique mapping reads but in reality that score means no map therefore STAR aligned bam files must be converted to an usable form by our tool named SplitNCigarReads otherwise you will not get any results out of your bam files. If you are using bowtie2 for alignment GetPileupSummaries uses MAPQ 50 as minimum therefore you will only get 0s since bowtie2 has a maximum MAPQ of 44. Mutect2 also does not report F1R2 summaries for sites with median MAPQ less than 50.
We may need answers for all of these questions but I will focus on the last one since that feels much likely to be the cause of it.
I hope this helps.
-
Hi again,
Thank you for explaining! This makes perfect sense—I used bowtie2 to align my WES data, so that is probably why both the F1R2 summaries and GetPileupSummaries results are filled with 0s. I didn't realize the MAPQ cutoffs were set that high by default; I appreciate you bringing this to my attention.
A few last clarifying questions (to make sure I am interpreting the documentation correctly):
-
For Mutect2: is it correct to say that --f1r2-median-mq (default 50) excludes all reads mapping to a given site if their median MAPQ as a group is < 50, while --f1r2-min-bq (default 20) excludes only reads mapping to a given site if the Phred score of the individual base at that locus for that particular read is < 20?
- For GetPileupSummaries: is it correct to say that --min-mapping-quality (default 50) would exclude only reads whose MAPQ is < 50?
Many thanks!
-
-
For both questions the answer is yes you are absolutely right. Both f1r2-median-mq and --min-mapping-quality parameters require eligible sites to have median MAPQ 50 or above and eligible reads to have MAPQ 50 or above. Anything below 50 is filtered out.
So your problem can be solved by switching to BWA MEM if it is fine for your case. If you wish to stick with bowtie2 you may need to give up on those parameters used for filtering Mutect2 calls.
I hope this helps.
-
Okay, I understand. Thanks again for the explanation!
Please sign in to leave a comment.
7 comments