Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Confusing Mutect2 F1R2 (_f1r2.tar.gz) outputs

0

7 comments

  • Avatar
    lreska3m

    As a follow-up, I'm also getting similar results when I run GetPileupSummaries (for input into CalculateContamination); here is the relevant command:

    gatk GetPileupSummaries \

    -I $bam1 \

    -V $sites \

    -L $sites \

    -O $out.table

    Here, $sites refers to "small_exac_common_3.hg38.vcf.gz", which I downloaded from the GATK Google bucket as instructed in this forum post. I'm using the "af-only-gnomAD.vcf.gz" reference for input into Mutect2 itself (referenced as $gnomAD in my original post above), if that matters.

    Representative output:

    #<METADATA>SAMPLE=001

    contig position ref_count alt_count other_alt_count allele_frequency

    1 17365 0 0 0 0.136

    1 17385 0 0 0 0.122

    1 69761 0 0 0 0.113

    1 187887 0 0 0 0.136

    1 187907 0 0 0 0.122

    1 962358 0 0 0 0.08

    1 963087 0 0 0 0.052

    1 973034 0 0 0 0.058

    The "ref_count", "alt_count", and "other_alt_count" column entries are all 0s. This happens for every sample.

    I should note that because I aligned these BAMs to the GRCh38 reference, I had to modify the hg38-derived sites files to strip the "chr" prefixes from each contig (and reindex the files accordingly). I'm not sure if that could be the cause of these discrepancies. I am, however, still getting a large number of variant calls per sample, so that at least seems to be fine.

    If anyone has any ideas why these outputs are full of zeros, please let me know! Thanks.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi lreska3m

    Where did you get the hg38 reference genome? Are you using the one from our resource bundle?

    0
    Comment actions Permalink
  • Avatar
    lreska3m

    Hello Gökalp Çelik,

    Thanks for reaching out. Let me try to clarify:

    • I aligned my data to the GRCh38 reference genome, using the annotation GTF from Ensembl release 111.
    • I downloaded both "af-only-gnomad.hg38.vcf.gz" and "small_exac_common_3.hg38.vcf.gz" from https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38 (following advice in this post and this post).
    • When I tried to run Mutect2 and Calculate Contamination as above, I received errors due to incompatible contig names. I then modified these files to match my reference genome by stripping the "chr" prefixes from each contig and reindexing the files with GRCh38-compliant chromosome names. Both tools ran successfully after this modification, but they produced the 0-filled output tables shown above.

    I have two questions, essentially:

    1. Is there some basic reason why these tools would produce output tables full of 0s? I realize I may have overlooked some compatibility issue when I modified the hg38-derived files.
    2. Alternatively, is there a GRCh38-compliant version of "af-only-gnomAD.vcf.gz" and "small_exac_common_3.vcf.gz" in the GATK resource bundle? I would prefer to keep aligning to the GRCh38 reference as that is my current pipeline standard, if possible.

    Any insights would be greatly appreciated! Thanks.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    Our resource bundle reference genomes and other accessory files are set to be compatible 100% for all tasks that we publish. On the other hand other reference genomes may have incompatible reference nucleotides, masked/unmasked duplicate regions where some of your reads could easily be siphoned into. Therefore we recommend using our own reference genomes or if you are using your preferred reference genome make sure that you convert our resource files to become compatible with your reference genome. 

    To understand better we may need to check the sequence dictionary of your reference genome. If you can post it here it would be great.

    Especially for those 0 scoring pileup sites can you share a couple of them from IGV views so that we can see what is going on. 

    One more thing. Is this a RNAseq experiment? Which aligner are you using? There are certain things related to using different aligners other than BWA MEM or dragmap. STAR aligner uses 255 MAPQ score for unique mapping reads but in reality that score means no map therefore STAR aligned bam files must be converted to an usable form by our tool named SplitNCigarReads otherwise you will not get any results out of your bam files. If you are using bowtie2 for alignment GetPileupSummaries uses MAPQ 50 as minimum therefore you will only get 0s since bowtie2 has a maximum MAPQ of 44. Mutect2 also does not report F1R2 summaries for sites with median MAPQ less than 50. 

    We may need answers for all of these questions but I will focus on the last one since that feels much likely to be the cause of it. 

    I hope this helps. 

    1
    Comment actions Permalink
  • Avatar
    lreska3m

    Hi again,

    Thank you for explaining! This makes perfect sense—I used bowtie2 to align my WES data, so that is probably why both the F1R2 summaries and GetPileupSummaries results are filled with 0s. I didn't realize the MAPQ cutoffs were set that high by default; I appreciate you bringing this to my attention.

    A few last clarifying questions (to make sure I am interpreting the documentation correctly):

    1. For Mutect2: is it correct to say that --f1r2-median-mq (default 50) excludes all reads mapping to a given site if their median MAPQ as a group is < 50, while --f1r2-min-bq (default 20) excludes only reads mapping to a given site if the Phred score of the individual base at that locus for that particular read is < 20?

    2. For GetPileupSummaries: is it correct to say that --min-mapping-quality (default 50) would exclude only reads whose MAPQ is < 50?

    Many thanks!

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    For both questions the answer is yes you are absolutely right. Both f1r2-median-mq and --min-mapping-quality parameters require eligible sites to have median MAPQ 50 or above and eligible reads to have MAPQ 50 or above. Anything below 50 is filtered out. 

    So your problem can be solved by switching to BWA MEM if it is fine for your case. If you wish to stick with bowtie2 you may need to give up on those parameters used for filtering Mutect2 calls.

    I hope this helps. 

    2
    Comment actions Permalink
  • Avatar
    lreska3m

    Okay, I understand. Thanks again for the explanation!

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk