Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Artefactual calls using Mutect2 on RNAseq data

Answered
0

25 comments

  • Avatar
    David Benjamin

    cmartinezruiz I don't see anything wrong with your pipeline.  I will need a few things for hopefully begin guessing what the issue is.

    First, could you post about 10 VCF lines of representative false positives (more if you want, but only false positives) from the output of FilterMutectCalls?

    Second, FilterMutectCalls automatically generates a small plain-text filtering stats file.  Could you post the contents of that file here?

    Finally, for at least one false positive could you re-run Mutect2 in a 1000-base interval surrounding the variant with the `-bamout out.bam` added to your command, and then show an IGV screenshot of out.bam along with the RNA-seq sample, but not the normal?

    If you want, you can start with the first two things.  Maybe some pattern will stand out before we even need to make a bamout.

    1
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Thanks for you quick response David Benjamin! Here are 10 artefactual variants (I kept only biallelic SNVs):

    chr1	900339	.	C	G	.	PASS	CONTQ=93;DP=353;ECNT=1;GERMQ=93;MBQ=37,32;MFRL=196,624;MMQ=60,60;MPOS=2;NALOD=0.814;NLOD=89.05;POPAF=6;ROQ=21;SEQQ=44;STRANDQ=84;TLOD=9.89	GT:AD:AF:DP:F1R2:F2R1:SB	0/1:2,4:0.625:6:0,0:2,2:2,0,2,2	0/0:325,1:0.00628:326:101,0:213,1:273,52,1,0
    chr1 8923293 . C T . PASS CONTQ=93;DP=873;ECNT=1;GERMQ=93;MBQ=37,32;MFRL=180,774;MMQ=60,60;MPOS=3;NALOD=2.86;NLOD=212.9;POPAF=6;ROQ=9;SEQQ=81;STRANDQ=50;TLOD=14.09 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:8,5:0.408:13:6,5:0,0:6,2,1,4 0/0:834,0:0.001373:834:176,0:643,0:452,382,0,0
    chr1 12371983 . G A . PASS CONTQ=93;DP=360;ECNT=2;GERMQ=93;MBQ=37,32;MFRL=220,2537;MMQ=60,60;MPOS=7;NALOD=2.5;NLOD=94.4;POPAF=6;ROQ=17;SEQQ=93;STRANDQ=93;TLOD=140.24 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB 0|1:1,33:0.943:34:0,0:1,33:0|1:12371983_G_A:12371983:1,0,10,23 0|0:326,0:0.003121:326:56,0:256,0:0|1:12371983_G_A:12371983:28,298,0,0
    chr1 12383761 . A G . PASS CONTQ=93;DP=423;ECNT=1;GERMQ=93;MBQ=37,33;MFRL=191,4977;MMQ=60,60;MPOS=1;NALOD=2.56;NLOD=107.69;POPAF=6;ROQ=24;SEQQ=93;STRANDQ=50;TLOD=14.38 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:0,4:0.833:4:0,0:0,4:0,0,4,0 0/0:406,1:0.002753:407:63,0:336,0:349,57,0,1
    chr1 15959932 . A T . PASS CONTQ=93;DP=437;ECNT=1;GERMQ=93;MBQ=37,28;MFRL=190,9393;MMQ=60,60;MPOS=1;NALOD=2.58;NLOD=113.62;POPAF=6;ROQ=38;SEQQ=9;STRANDQ=44;TLOD=5.95 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:0,2:0.782:2:0,0:0,2:0,0,2,0 0/0:420,0:0.002597:420:63,0:347,0:341,79,0,0
    chr1 16054854 . T G . PASS CONTQ=93;DP=383;ECNT=1;GERMQ=93;MBQ=37,22;MFRL=227,840;MMQ=60,60;MPOS=3;NALOD=2.51;NLOD=96.99;POPAF=6;ROQ=9;SEQQ=93;STRANDQ=93;TLOD=59.62 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:10,15:0.633:25:0,0:9,15:5,5,4,11 0/0:358,0:0.003061:358:70,0:257,0:76,282,0,0
    chr1 17250975 . G C . PASS CONTQ=93;DP=286;ECNT=1;GERMQ=93;MBQ=37,26;MFRL=171,6399;MMQ=60,60;MPOS=4;NALOD=2.38;NLOD=70.13;POPAF=6;ROQ=30;SEQQ=5;STRANDQ=65;TLOD=5.46 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:0,2:0.75:2:0,0:0,2:0,0,1,1 0/0:272,0:0.004165:272:55,0:214,0:134,138,0,0
    chr1 19523631 . T A . PASS CONTQ=93;DP=925;ECNT=1;GERMQ=93;MBQ=37,38;MFRL=186,4579;MMQ=60,60;MPOS=1;NALOD=2.89;NLOD=232.81;POPAF=6;ROQ=20;SEQQ=16;STRANDQ=59;TLOD=7.05 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:1,2:0.6:3:1,2:0,0:1,0,1,1 0/0:892,0:0.001273:892:147,0:731,0:599,293,0,0
    chr1 20987373 . C T . PASS CONTQ=93;DP=289;ECNT=1;GERMQ=93;MBQ=37,37;MFRL=181,5029;MMQ=60,60;MPOS=1;NALOD=2.38;NLOD=70.94;POPAF=6;ROQ=11;SEQQ=67;STRANDQ=36;TLOD=12.44 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:4,4:0.555:8:4,4:0,0:1,3,0,4 0/0:272,0:0.004152:272:61,0:207,0:170,102,0,0
    chr1 22150612 . A C . PASS CONTQ=93;DP=176;ECNT=1;GERMQ=93;MBQ=37,25;MFRL=193,591;MMQ=60,60;MPOS=3;NALOD=2.16;NLOD=41.6;POPAF=6;ROQ=16;SEQQ=93;STRANDQ=47;TLOD=17.58 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:10,8:0.453:18:8,8:1,0:3,7,7,1 0/0:154,0:0.006865:154:34,0:114,0:120,34,0,0

    And here are the contents of the .stats file:

    statistic	value
    callable 5.2029427E7

    In the meantime I will generate the out.bam file. Thanks again!

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    cmartinezruiz I should have been more clear -- I need the .filteringStats.tsv file produced as an output by FilterMutectCalls, not the Mutect stats that it requires as an input.  But in the meantime those VCF records will give me something to study.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Two things jump out to me:  First, the MPOS (median distance from the end of the read) is very low for all the false positives.  We are usually very conservative about filtering based on that but for RNA seq it may be appropriate to be more aggressive than the DNA seq default.  Although we don't like to involve ourselves much with arbitrary hard filters, it may be correct here to use a larger value of `--min-median-read-position`, like 5.

    The other big signal is that the MFRL (median fragment length) is around 200 for reference alleles and much larger for alt alleles.  This could just be due to different splicing, but it's suspicious.

    It can't hurt to send the filtering stats, but based on these signatures IGV screenshots, with the original bams and the bamouts, and sorted by the bamout base, will be most helpful.  It may be useful to show it zoomed in to 100 bases, 300 bases, and 1000 bases.  The loci I would most like to see are:

    chr1	12371983	.	G	A
    chr1	15959932	.	A	T
    chr1	900339	.	C	G


    1
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Sorry, here are the contents of the stats file:

    #<METADATA>Ln prior of deletion of length 10=-16.668707731660547
    #<METADATA>Ln prior of deletion of length 9=-16.668707731660547
    #<METADATA>Ln prior of deletion of length 8=-15.821409871273344
    #<METADATA>Ln prior of deletion of length 7=-16.668707731660547
    #<METADATA>Ln prior of deletion of length 6=-16.668707731660547
    #<METADATA>Ln prior of deletion of length 5=-15.570095442992438
    #<METADATA>Ln prior of deletion of length 4=-15.975560551100603
    #<METADATA>Ln prior of deletion of length 3=-15.128262690713399
    #<METADATA>Ln prior of deletion of length 2=-15.570095442992438
    #<METADATA>Ln prior of deletion of length 1=-15.128262690713399
    #<METADATA>Ln prior of SNV=-11.425198601607507
    #<METADATA>Ln prior of insertion of length 1=-13.797028106776535
    #<METADATA>Ln prior of insertion of length 2=-14.771587746774667
    #<METADATA>Ln prior of insertion of length 3=-14.240959495712497
    #<METADATA>Ln prior of insertion of length 4=-14.471483154324329
    #<METADATA>Ln prior of insertion of length 5=-14.631825804399508
    #<METADATA>Ln prior of insertion of length 6=-14.994731298088876
    #<METADATA>Ln prior of insertion of length 7=-15.128262690713399
    #<METADATA>Ln prior of insertion of length 8=-14.876948262432494
    #<METADATA>Ln prior of insertion of length 9=-15.202370662867121
    #<METADATA>Ln prior of insertion of length 10=-15.128262690713399
    #<METADATA>High-AF beta-binomial cluster=weight = 0.0285, alpha = 10.21, beta = 0.87
    #<METADATA>Background beta-binomial cluster=weight = 0.0354, alpha = 1.38, beta = 0.69
    #<METADATA>Binomial cluster 1=weight = 0.2941, mean = 0.874
    #<METADATA>Binomial cluster 1=weight = 0.2606, mean = 0.967
    #<METADATA>Binomial cluster 1=weight = 0.1747, mean = 0.755
    #<METADATA>Binomial cluster 1=weight = 0.1062, mean = 0.928
    #<METADATA>Binomial cluster 1=weight = 0.0719, mean = 0.597
    #<METADATA>Binomial cluster 1=weight = 0.0357, mean = 0.472
    #<METADATA>Binomial cluster 1=weight = 0.0198, mean = 0.318
    #<METADATA>Binomial cluster 1=weight = 0.0162, mean = 0.151
    #<METADATA>Binomial cluster 1=weight = 0.0115, mean = 0.634
    #<METADATA>Binomial cluster 1=weight = 0.0055, mean = 0.425
    #<METADATA>Binomial cluster 1=weight = 0.0038, mean = 0.052
    #<METADATA>threshold=0.507
    #<METADATA>fdr=0.018
    #<METADATA>sensitivity=0.984
    filter FP FDR FN FNR
    weak_evidence 44.26 0.01 51.86 0.01
    strand_bias 0.32 0.0 0.04 0.0
    normal_artifact 0.01 0.0 0.0 0.0
    orientation 24.69 0.01 6.76 0.0
    slippage 0.0 0.0 0.0 0.0
    haplotype 3.59 0.0 0.32 0.0
    germline 0.0 0.0 0.0 0.0
    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Here are a couple screen shots of the RNAseq bam used (top track) and the out.bam file for that region. The SNV in the screenshot is the first of the list of 10 SNVs I sent in my earlier message

     

    Please let me know if you need more info. Thanks again!

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi cmartinezruiz

    David Benjamin is looking into this issue for you. Would you please provide a subset of the bam with the problematic variants to help David debug this issue you are facing. Also please subset the bam with 1000bps up and down stream of the problematic variants.

    You can find info on how to send the data to us here: https://gatk.zendesk.com/hc/en-us/articles/360035889671

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Hello David Benjamin and Bhanu Gandham,

    I have uploaded a file to the ftp server called 'to_gatk_rna_mtect2_artefact.zip'

    This zip file contains the following documents:

    gatk_subsample_focal_1000bp_rna.bam: The requested subsampled BAM file to 1000bp of each variant. This BAM file is for the RNAseq focal sample. Please note that the positions I used for sub-sampling are for all variants, not just the artefactual ones (if I knew how to separate artefactual variants from real ones I would not have this problem), but the vast majority of variants seem to be artefactual.

    gatk_subsample_gl_1000bp_dna.bam: The subsampled BAM file to 1000bp of each variant for the control DNA germline.

    mutect2_gatk_subsample.sh: The commands I used to get the variants.

    Due to unrelated issues I had to re-install GATK with the latest version, which changed somewhat the results I posted in this thread initially (which I guess already suggests there is something wrong with these variants, as I wouldn't think changing GATK versions should alter the results too much). I re-ran everything on GATK v4.1.5.0, and in the zip file I also included 10 snvs with artefactual patterns and the stats.tsv file from the run that filtered them.

    I hope this helps! Please let me know if you need more information and thank you again!

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    cmartinezruiz

     

    David Benjamin is looking into this and will get back to you shortly.

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Hello David Benjamin Bhanu Gandham,

    Has there been any progress on this front by any chance? I did manage a way around which consist in restricting the variant call exclusively to exons and removing any variants with no actual coverage in the BAM file itself. While I manage to get some variants this way, it is still not ideal, and I am a bit worried that some artefacts might have escaped the filtering.Please let me know if any progress has been done here, it would be great to be able to apply Mutect2 to RNAseq samples!

     

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    David Benjamin  I also noticed that a very large proportion of false negatives were flagged as "orientation". I checked by looking at mutations that we know are expressed and that therefore should be picked up by Mutect2. So I imagine that the orientation model is not fit for RNAseq?

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    HI cmartinezruiz

     

    Due to the coronovirus situation our team is short staffed. It might take us sometime to get back to you. I apologize for the inconvenience.

    Thank you for your patience. 

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Yes, of course! No worries

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    The last two months have been. . . interesting. . . but I have finally started looking at your data.  I hesitate to make any promises.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    cmartinezruiz Of your 10 examples on chr1, all but one go away if you turn on the supplementary alignment read filter with "-read-filter NotSupplementaryAlignmentReadFilter".  The one that remains, chr1:10189509, has a primary alignment soft-clipped almost exactly in the middle of its read pairs.  That is, the primary alignments supporting the artifact look a lot like supplementary alignments.

    Since we're trying to develop RNA best practices and since we're not RNA specialists ourselves, I have a question for you.  Would you expect filtering all supplementary alignments to cause a loss of sensitivity, or can we simply turn on the filter and go about our business?

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    David Benjamin Unfortunately I am not a specialist in using RNA for detecting variants either. I have been testing it on some of the samples I am running. By adding the filter "-read-filter NotSupplementaryAlignmentReadFilter" I do lose about half of the raw variants, but after applying additional filtering the number of variants I trust is similar to that I had before. It might be worth losing a few true positives if it means removing most of the noise.

    Relating to a previous comment I made, I have seen that many true positives often get filtered out under the tags "orientation", as well as "haplotype" (which I guess is a consequence of those being in the same haplotype of those filtered out by "orientation"), and sometimes also "fragment". I am using those tags to recover some false negatives, but I wonder whether this information could help you understand what is going on at a deeper level.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    It sounds as if turning on that read filter is very much worthwhile and won't hurt sensitivity much, if at all.

    "fragment" gets triggered when the evidence for the alt allele occurs primarily or exclusively in reads with very distant mates, by default 10,000 bases away.  While this can happen in the presence of genuine structural variation, it's usually a sign of mapping error.  If this is causing false negatives I suspect that filtering supplementary alignment reads will fix it.  But more likely, I think, is that it is filtering correctly.

    The orientation filtering is more puzzling.  The pipeline you mentioned above goes straight from Mutect2 to FilterMutectCalls.  Are you running LearnReadOrientationModel in between?

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Ah yes, sorry I should have mentioned that. I did add the LearnReadOrientationModel to the pipeline, although given the false negatives I am getting I was considering skipping that step.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    You can usually get away with skipping the orientation bias model if your sample is not FFPE, which of course for RNA it wouldn't be.  Nonetheless, I would be interested in investigating.  Do you still get the orientation false negative when filter supplementary alignments?  Could you share the ob-priors file that went into FilterMutectCalls?  And could you post a random sampling of VCF lines with the filter eg "grep orientation filtered.vcf | head -n 20"?

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    I still get the false negative orientation, even when using "-read-filter NotSupplementaryAlignmentReadFilter". So I think I will just skip the orientation bias model as you suggest. I have uploaded 20 variants with the tag orientation only and the ob-priors model to the ftp server under the name to_gatk_rna_mutect2_orientation_artefact.zip

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    cmartinezruiz I finally had the time to look at your filtered variants.  In every case every read, both alt and ref, had the same orientation.  Some variants were all F1R2, some were all F2R1.  (F1R2, by the way, means a read that comes from a fragment in which read 1 comes from the forward strand and read 2 from the reverse strand).  Normally such orientation bias is an extremely compelling signature of an artifact, but in your case the ref allele has the same signature.

    Can you think of a reason why your data has this pattern?  Otherwise, could you briefly describe your sequencing protocol or point us to documentation?  The most relevant aspects are fragmentation, ligating adapters, and the sequencing platform used.

    Regardless of why, given that your data looks this way I think it is safe to say that the F1R2/F2R1 signal the filter looks for is meaningless here, and so skipping this filter is correct.

    0
    Comment actions Permalink
  • Avatar
    cmartinezruiz

    Thank you very much for the explanation David Benjamin, it is very informative.

    The library preparation we used for RNAseq is stranded. This means that each strand is amplified separately during the cDNA generation step. I suspect those separate strands are then sequenced separately to keep strandedness information.

    Could that be causing the observed pattern? If so, this will be a common pattern in many RNA seq datasets.

    0
    Comment actions Permalink
  • Avatar
    David Benjamin

    Aha!  I think we are ready to understand the problem.  The orientation bias model detects errors that appeared on one strand only of double-stranded DNA at the time of adapter ligation, before amplification.  That is, it assumes double-stranded DNA fragments with adapters where one strand has a substitution error and the other strand doesn't.  If strands are amplified separately, without forked adapters to join their fates, the model doesn't apply, and furthermore no model could detect an error.  The signature is two strands that don't base-pair properly; a single-strand can't have this signature!

    Therefore, it's safe to say: do not run the read orientation filter unless you have double-stranded DNA at the time PCR adapters are attached and the two strands are amplified together.

    As you correctly understood, this does not mean that the model never applies to RNA, because the model is suitable for detecting substitution errors that occur on a single strand of a cDNA library.  However, it does mean that part of RNA best practices is to not turn on this model blindly.  

    Thank you very much cmartinezruiz for working this out with me.  Bhanu Gandham you may be interested in this thread's conclusions.

    1
    Comment actions Permalink
  • Avatar
    WangZiwei

    Hi David Benjamin

    Could you please help explain the filteringStats.tsv file, the output of 'FilterMutectCalls' tool?

    Or is there something like a detailed description of this output?

    Currently I have no idea how to take advantge of this file to improve mutation filtering.

    Many thanks.

    Wang Ziwei

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi WangZiwei,

    The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.

    Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.

    We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.

    For context, check out our support policy.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk