Artefactual calls using Mutect2 on RNAseq data

Answered

February 14, 2020 10:46

Hello all,

I am using Mutect2 (GATK 4.1.4.1) to call somatic mutations on RNAseq samples using germline samples from whole exome sequencing. I have run the RNAseq BAMs through the recommended pre-processing and ran Mutect2 and the filtering of the resulting VCF with default parameters (code below).

From the resulting VCF I obtain many variants that look artefactual:

They are almost invariably at the intronic edges of exons
They sometimes appear in regions with little or even no coverage
If I look through them manually in IGV I can't see any base changes between the germline, the RNAseq sample and the reference.

Is this an known artefact from RNAseq data when using DNA as germline in Mutect2? I have seen publications where this same method was used, but no artefacts where reported, so I am assuming I am doing something wrong, but I can't put my finger on what it is. Also, why would Mutect2 not filter out variants in regions with little to no coverage? It sometimes outputs "variants" with DP as low as 2.

Code used for the pre-processing:

java -jar $EBROOTPICARD/picard.jar MarkDuplicates\
      I=${BAM} \
      O=tmp/${SAMPLE_ID}_md.bam \
      M=tmp/${SAMPLE_ID}_md_metrics.txt

gatk SplitNCigarReads \
  -R input/genome.fa\
  -I tmp/${SAMPLE_ID}_md.bam \
  -O tmp/${SAMPLE_ID}_md_split.bam

gatk BaseRecalibrator \
      -I tmp/${SAMPLE_ID}_md_split.bam \
      -R input/genome.fa \
      --known-sites input/dbsnp_138.hg19.vcf \
      -O tmp/${SAMPLE_ID}.table

gatk ApplyBQSR \
  -R input/genome.fa \
  -I tmp/${SAMPLE_ID}_md_split.bam \
  --bqsr-recal-file tmp/${SAMPLE_ID}.table \
  -O tmp/pre_processed_${SAMPLE_ID}.bam

Code used for the Mutect2 call and filtering:

gatk Mutect2\
                  --reference input/ucsc.hg19.fasta\
                  --intervals tmp/exome_regions.list\
                  -I ${GERMLINE[@]}\
                  -I ${SAMPLES[@]}\
                  -normal ${INPUT_PATIENT}_BS_GL\
                  --tumor-lod-to-emit 2.0\
                  --f1r2-tar-gz tmp/${INPUT_PATIENT}_f1r2.tar.gz\
                  --output tmp/'${INPUT_PATIENT}'.vcf

gatk FilterMutectCalls \
   -R input/ucsc.hg19.fasta \
   -V tmp/${PATIENT}.vcf \
   --ob-priors tmp/${PATIENT}_f1r2.tar.gz \
   -O tmp/${PATIENT}_rna_mutect2_filtered.vcf

25 comments

David Benjamin

February 14, 2020 13:56
cmartinezruiz I don't see anything wrong with your pipeline. I will need a few things for hopefully begin guessing what the issue is.

First, could you post about 10 VCF lines of representative false positives (more if you want, but only false positives) from the output of FilterMutectCalls?

Second, FilterMutectCalls automatically generates a small plain-text filtering stats file. Could you post the contents of that file here?

Finally, for at least one false positive could you re-run Mutect2 in a 1000-base interval surrounding the variant with the `-bamout out.bam` added to your command, and then show an IGV screenshot of out.bam along with the RNA-seq sample, but not the normal?

If you want, you can start with the first two things. Maybe some pattern will stand out before we even need to make a bamout.
1

Comment actions Permalink

cmartinezruiz

February 14, 2020 15:45

Thanks for you quick response David Benjamin! Here are 10 artefactual variants (I kept only biallelic SNVs):

chr1	900339	.	C	G	.	PASS	CONTQ=93;DP=353;ECNT=1;GERMQ=93;MBQ=37,32;MFRL=196,624;MMQ=60,60;MPOS=2;NALOD=0.814;NLOD=89.05;POPAF=6;ROQ=21;SEQQ=44;STRANDQ=84;TLOD=9.89	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:2,4:0.625:6:0,0:2,2:2,0,2,2	0/0:325,1:0.00628:326:101,0:213,1:273,52,1,0
chr1	8923293	.	C	T	.	PASS	CONTQ=93;DP=873;ECNT=1;GERMQ=93;MBQ=37,32;MFRL=180,774;MMQ=60,60;MPOS=3;NALOD=2.86;NLOD=212.9;POPAF=6;ROQ=9;SEQQ=81;STRANDQ=50;TLOD=14.09	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:8,5:0.408:13:6,5:0,0:6,2,1,4	0/0:834,0:0.001373:834:176,0:643,0:452,382,0,0
chr1	12371983	.	G	A	.	PASS	CONTQ=93;DP=360;ECNT=2;GERMQ=93;MBQ=37,32;MFRL=220,2537;MMQ=60,60;MPOS=7;NALOD=2.5;NLOD=94.4;POPAF=6;ROQ=17;SEQQ=93;STRANDQ=93;TLOD=140.24	GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB	0|1:1,33:0.943:34:0,0:1,33:0|1:12371983_G_A:12371983:1,0,10,23	0|0:326,0:0.003121:326:56,0:256,0:0|1:12371983_G_A:12371983:28,298,0,0
chr1	12383761	.	A	G	.	PASS	CONTQ=93;DP=423;ECNT=1;GERMQ=93;MBQ=37,33;MFRL=191,4977;MMQ=60,60;MPOS=1;NALOD=2.56;NLOD=107.69;POPAF=6;ROQ=24;SEQQ=93;STRANDQ=50;TLOD=14.38	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:0,4:0.833:4:0,0:0,4:0,0,4,0	0/0:406,1:0.002753:407:63,0:336,0:349,57,0,1
chr1	15959932	.	A	T	.	PASS	CONTQ=93;DP=437;ECNT=1;GERMQ=93;MBQ=37,28;MFRL=190,9393;MMQ=60,60;MPOS=1;NALOD=2.58;NLOD=113.62;POPAF=6;ROQ=38;SEQQ=9;STRANDQ=44;TLOD=5.95	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:0,2:0.782:2:0,0:0,2:0,0,2,0	0/0:420,0:0.002597:420:63,0:347,0:341,79,0,0
chr1	16054854	.	T	G	.	PASS	CONTQ=93;DP=383;ECNT=1;GERMQ=93;MBQ=37,22;MFRL=227,840;MMQ=60,60;MPOS=3;NALOD=2.51;NLOD=96.99;POPAF=6;ROQ=9;SEQQ=93;STRANDQ=93;TLOD=59.62	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:10,15:0.633:25:0,0:9,15:5,5,4,11	0/0:358,0:0.003061:358:70,0:257,0:76,282,0,0
chr1	17250975	.	G	C	.	PASS	CONTQ=93;DP=286;ECNT=1;GERMQ=93;MBQ=37,26;MFRL=171,6399;MMQ=60,60;MPOS=4;NALOD=2.38;NLOD=70.13;POPAF=6;ROQ=30;SEQQ=5;STRANDQ=65;TLOD=5.46	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:0,2:0.75:2:0,0:0,2:0,0,1,1	0/0:272,0:0.004165:272:55,0:214,0:134,138,0,0
chr1	19523631	.	T	A	.	PASS	CONTQ=93;DP=925;ECNT=1;GERMQ=93;MBQ=37,38;MFRL=186,4579;MMQ=60,60;MPOS=1;NALOD=2.89;NLOD=232.81;POPAF=6;ROQ=20;SEQQ=16;STRANDQ=59;TLOD=7.05	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:1,2:0.6:3:1,2:0,0:1,0,1,1	0/0:892,0:0.001273:892:147,0:731,0:599,293,0,0
chr1	20987373	.	C	T	.	PASS	CONTQ=93;DP=289;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=181,5029;MMQ=60,60;MPOS=1;NALOD=2.38;NLOD=70.94;POPAF=6;ROQ=11;SEQQ=67;STRANDQ=36;TLOD=12.44	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:4,4:0.555:8:4,4:0,0:1,3,0,4	0/0:272,0:0.004152:272:61,0:207,0:170,102,0,0
chr1	22150612	.	A	C	.	PASS	CONTQ=93;DP=176;ECNT=1;GERMQ=93;MBQ=37,25;MFRL=193,591;MMQ=60,60;MPOS=3;NALOD=2.16;NLOD=41.6;POPAF=6;ROQ=16;SEQQ=93;STRANDQ=47;TLOD=17.58	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:10,8:0.453:18:8,8:1,0:3,7,7,1	0/0:154,0:0.006865:154:34,0:114,0:120,34,0,0

And here are the contents of the .stats file:

statistic	value
callable	5.2029427E7

In the meantime I will generate the out.bam file. Thanks again!

David Benjamin

February 14, 2020 15:55
cmartinezruiz I should have been more clear -- I need the .filteringStats.tsv file produced as an output by FilterMutectCalls, not the Mutect stats that it requires as an input. But in the meantime those VCF records will give me something to study.
0

Comment actions Permalink
David Benjamin

February 14, 2020 16:11
Two things jump out to me: First, the MPOS (median distance from the end of the read) is very low for all the false positives. We are usually very conservative about filtering based on that but for RNA seq it may be appropriate to be more aggressive than the DNA seq default. Although we don't like to involve ourselves much with arbitrary hard filters, it may be correct here to use a larger value of `--min-median-read-position`, like 5.

The other big signal is that the MFRL (median fragment length) is around 200 for reference alleles and much larger for alt alleles. This could just be due to different splicing, but it's suspicious.

It can't hurt to send the filtering stats, but based on these signatures IGV screenshots, with the original bams and the bamouts, and sorted by the bamout base, will be most helpful. It may be useful to show it zoomed in to 100 bases, 300 bases, and 1000 bases. The loci I would most like to see are:
```
chr1	12371983	.	G	A
```
```
chr1	15959932	.	A	T
```
```
chr1	900339	.	C	G
```
1

Comment actions Permalink

cmartinezruiz

February 14, 2020 16:15

Sorry, here are the contents of the stats file:

#<METADATA>Ln prior of deletion of length 10=-16.668707731660547
#<METADATA>Ln prior of deletion of length 9=-16.668707731660547
#<METADATA>Ln prior of deletion of length 8=-15.821409871273344
#<METADATA>Ln prior of deletion of length 7=-16.668707731660547
#<METADATA>Ln prior of deletion of length 6=-16.668707731660547
#<METADATA>Ln prior of deletion of length 5=-15.570095442992438
#<METADATA>Ln prior of deletion of length 4=-15.975560551100603
#<METADATA>Ln prior of deletion of length 3=-15.128262690713399
#<METADATA>Ln prior of deletion of length 2=-15.570095442992438
#<METADATA>Ln prior of deletion of length 1=-15.128262690713399
#<METADATA>Ln prior of SNV=-11.425198601607507
#<METADATA>Ln prior of insertion of length 1=-13.797028106776535
#<METADATA>Ln prior of insertion of length 2=-14.771587746774667
#<METADATA>Ln prior of insertion of length 3=-14.240959495712497
#<METADATA>Ln prior of insertion of length 4=-14.471483154324329
#<METADATA>Ln prior of insertion of length 5=-14.631825804399508
#<METADATA>Ln prior of insertion of length 6=-14.994731298088876
#<METADATA>Ln prior of insertion of length 7=-15.128262690713399
#<METADATA>Ln prior of insertion of length 8=-14.876948262432494
#<METADATA>Ln prior of insertion of length 9=-15.202370662867121
#<METADATA>Ln prior of insertion of length 10=-15.128262690713399
#<METADATA>High-AF beta-binomial cluster=weight = 0.0285, alpha = 10.21, beta = 0.87
#<METADATA>Background beta-binomial cluster=weight = 0.0354, alpha = 1.38, beta = 0.69
#<METADATA>Binomial cluster 1=weight = 0.2941, mean = 0.874
#<METADATA>Binomial cluster 1=weight = 0.2606, mean = 0.967
#<METADATA>Binomial cluster 1=weight = 0.1747, mean = 0.755
#<METADATA>Binomial cluster 1=weight = 0.1062, mean = 0.928
#<METADATA>Binomial cluster 1=weight = 0.0719, mean = 0.597
#<METADATA>Binomial cluster 1=weight = 0.0357, mean = 0.472
#<METADATA>Binomial cluster 1=weight = 0.0198, mean = 0.318
#<METADATA>Binomial cluster 1=weight = 0.0162, mean = 0.151
#<METADATA>Binomial cluster 1=weight = 0.0115, mean = 0.634
#<METADATA>Binomial cluster 1=weight = 0.0055, mean = 0.425
#<METADATA>Binomial cluster 1=weight = 0.0038, mean = 0.052
#<METADATA>threshold=0.507
#<METADATA>fdr=0.018
#<METADATA>sensitivity=0.984
filter	FP	FDR	FN	FNR
weak_evidence	44.26	0.01	51.86	0.01
strand_bias	0.32	0.0	0.04	0.0
normal_artifact	0.01	0.0	0.0	0.0
orientation	24.69	0.01	6.76	0.0
slippage	0.0	0.0	0.0	0.0
haplotype	3.59	0.0	0.32	0.0
germline	0.0	0.0	0.0	0.0

cmartinezruiz

February 14, 2020 16:29
Here are a couple screen shots of the RNAseq bam used (top track) and the out.bam file for that region. The SNV in the screenshot is the first of the list of 10 SNVs I sent in my earlier message

Please let me know if you need more info. Thanks again!
0

Comment actions Permalink
Bhanu Gandham

March 02, 2020 19:34
Hi cmartinezruiz

David Benjamin is looking into this issue for you. Would you please provide a subset of the bam with the problematic variants to help David debug this issue you are facing. Also please subset the bam with 1000bps up and down stream of the problematic variants.

You can find info on how to send the data to us here: https://gatk.zendesk.com/hc/en-us/articles/360035889671
0

Comment actions Permalink
cmartinezruiz

March 05, 2020 13:59
Hello David Benjamin and Bhanu Gandham,

I have uploaded a file to the ftp server called 'to_gatk_rna_mtect2_artefact.zip'

This zip file contains the following documents:

gatk_subsample_focal_1000bp_rna.bam: The requested subsampled BAM file to 1000bp of each variant. This BAM file is for the RNAseq focal sample. Please note that the positions I used for sub-sampling are for all variants, not just the artefactual ones (if I knew how to separate artefactual variants from real ones I would not have this problem), but the vast majority of variants seem to be artefactual.

gatk_subsample_gl_1000bp_dna.bam: The subsampled BAM file to 1000bp of each variant for the control DNA germline.

mutect2_gatk_subsample.sh: The commands I used to get the variants.

Due to unrelated issues I had to re-install GATK with the latest version, which changed somewhat the results I posted in this thread initially (which I guess already suggests there is something wrong with these variants, as I wouldn't think changing GATK versions should alter the results too much). I re-ran everything on GATK v4.1.5.0, and in the zip file I also included 10 snvs with artefactual patterns and the stats.tsv file from the run that filtered them.

I hope this helps! Please let me know if you need more information and thank you again!
0

Comment actions Permalink
Bhanu Gandham

March 10, 2020 21:10
cmartinezruiz

David Benjamin is looking into this and will get back to you shortly.
0

Comment actions Permalink
cmartinezruiz

April 29, 2020 10:33
Hello David Benjamin Bhanu Gandham,

Has there been any progress on this front by any chance? I did manage a way around which consist in restricting the variant call exclusively to exons and removing any variants with no actual coverage in the BAM file itself. While I manage to get some variants this way, it is still not ideal, and I am a bit worried that some artefacts might have escaped the filtering.Please let me know if any progress has been done here, it would be great to be able to apply Mutect2 to RNAseq samples!
0

Comment actions Permalink
cmartinezruiz

May 01, 2020 15:37
David Benjamin I also noticed that a very large proportion of false negatives were flagged as "orientation". I checked by looking at mutations that we know are expressed and that therefore should be picked up by Mutect2. So I imagine that the orientation model is not fit for RNAseq?
0

Comment actions Permalink
Bhanu Gandham

May 04, 2020 14:03
HI cmartinezruiz

Due to the coronovirus situation our team is short staffed. It might take us sometime to get back to you. I apologize for the inconvenience.

Thank you for your patience.
0

Comment actions Permalink
cmartinezruiz

May 05, 2020 07:59
Yes, of course! No worries
0

Comment actions Permalink
David Benjamin

May 06, 2020 06:16
The last two months have been. . . interesting. . . but I have finally started looking at your data. I hesitate to make any promises.
0

Comment actions Permalink
David Benjamin

May 11, 2020 05:05
cmartinezruiz Of your 10 examples on chr1, all but one go away if you turn on the supplementary alignment read filter with "-read-filter NotSupplementaryAlignmentReadFilter". The one that remains, chr1:10189509, has a primary alignment soft-clipped almost exactly in the middle of its read pairs. That is, the primary alignments supporting the artifact look a lot like supplementary alignments.

Since we're trying to develop RNA best practices and since we're not RNA specialists ourselves, I have a question for you. Would you expect filtering all supplementary alignments to cause a loss of sensitivity, or can we simply turn on the filter and go about our business?
0

Comment actions Permalink
cmartinezruiz

May 11, 2020 14:11
David Benjamin Unfortunately I am not a specialist in using RNA for detecting variants either. I have been testing it on some of the samples I am running. By adding the filter "-read-filter NotSupplementaryAlignmentReadFilter" I do lose about half of the raw variants, but after applying additional filtering the number of variants I trust is similar to that I had before. It might be worth losing a few true positives if it means removing most of the noise.

Relating to a previous comment I made, I have seen that many true positives often get filtered out under the tags "orientation", as well as "haplotype" (which I guess is a consequence of those being in the same haplotype of those filtered out by "orientation"), and sometimes also "fragment". I am using those tags to recover some false negatives, but I wonder whether this information could help you understand what is going on at a deeper level.
0

Comment actions Permalink
David Benjamin

May 14, 2020 06:24
It sounds as if turning on that read filter is very much worthwhile and won't hurt sensitivity much, if at all.

"fragment" gets triggered when the evidence for the alt allele occurs primarily or exclusively in reads with very distant mates, by default 10,000 bases away. While this can happen in the presence of genuine structural variation, it's usually a sign of mapping error. If this is causing false negatives I suspect that filtering supplementary alignment reads will fix it. But more likely, I think, is that it is filtering correctly.

The orientation filtering is more puzzling. The pipeline you mentioned above goes straight from Mutect2 to FilterMutectCalls. Are you running LearnReadOrientationModel in between?
0

Comment actions Permalink
cmartinezruiz

May 14, 2020 09:05
Ah yes, sorry I should have mentioned that. I did add the LearnReadOrientationModel to the pipeline, although given the false negatives I am getting I was considering skipping that step.
0

Comment actions Permalink
David Benjamin

May 14, 2020 13:14
You can usually get away with skipping the orientation bias model if your sample is not FFPE, which of course for RNA it wouldn't be. Nonetheless, I would be interested in investigating. Do you still get the orientation false negative when filter supplementary alignments? Could you share the ob-priors file that went into FilterMutectCalls? And could you post a random sampling of VCF lines with the filter eg "grep orientation filtered.vcf | head -n 20"?
0

Comment actions Permalink
cmartinezruiz

May 15, 2020 16:26
I still get the false negative orientation, even when using "-read-filter NotSupplementaryAlignmentReadFilter". So I think I will just skip the orientation bias model as you suggest. I have uploaded 20 variants with the tag orientation only and the ob-priors model to the ftp server under the name to_gatk_rna_mutect2_orientation_artefact.zip
0

Comment actions Permalink
David Benjamin

June 01, 2020 05:38
cmartinezruiz I finally had the time to look at your filtered variants. In every case every read, both alt and ref, had the same orientation. Some variants were all F1R2, some were all F2R1. (F1R2, by the way, means a read that comes from a fragment in which read 1 comes from the forward strand and read 2 from the reverse strand). Normally such orientation bias is an extremely compelling signature of an artifact, but in your case the ref allele has the same signature.

Can you think of a reason why your data has this pattern? Otherwise, could you briefly describe your sequencing protocol or point us to documentation? The most relevant aspects are fragmentation, ligating adapters, and the sequencing platform used.

Regardless of why, given that your data looks this way I think it is safe to say that the F1R2/F2R1 signal the filter looks for is meaningless here, and so skipping this filter is correct.
0

Comment actions Permalink
cmartinezruiz

June 11, 2020 15:41
Thank you very much for the explanation David Benjamin, it is very informative.

The library preparation we used for RNAseq is stranded. This means that each strand is amplified separately during the cDNA generation step. I suspect those separate strands are then sequenced separately to keep strandedness information.

Could that be causing the observed pattern? If so, this will be a common pattern in many RNA seq datasets.
0

Comment actions Permalink
David Benjamin

June 15, 2020 01:55
Aha! I think we are ready to understand the problem. The orientation bias model detects errors that appeared on one strand only of double-stranded DNA at the time of adapter ligation, before amplification. That is, it assumes double-stranded DNA fragments with adapters where one strand has a substitution error and the other strand doesn't. If strands are amplified separately, without forked adapters to join their fates, the model doesn't apply, and furthermore no model could detect an error. The signature is two strands that don't base-pair properly; a single-strand can't have this signature!

Therefore, it's safe to say: do not run the read orientation filter unless you have double-stranded DNA at the time PCR adapters are attached and the two strands are amplified together.

As you correctly understood, this does not mean that the model never applies to RNA, because the model is suitable for detecting substitution errors that occur on a single strand of a cDNA library. However, it does mean that part of RNA best practices is to not turn on this model blindly.

Thank you very much cmartinezruiz for working this out with me. Bhanu Gandham you may be interested in this thread's conclusions.
1

Comment actions Permalink
WangZiwei

August 08, 2020 17:56

Edited
Hi David Benjamin

Could you please help explain the filteringStats.tsv file, the output of 'FilterMutectCalls' tool?

Or is there something like a detailed description of this output?

Currently I have no idea how to take advantge of this file to improve mutation filtering.

Many thanks.

Wang Ziwei
0

Comment actions Permalink
Genevieve Brandt (she/her)

August 10, 2020 14:53
Hi WangZiwei,

The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

For context, check out our support policy.
0

Comment actions Permalink