I am trying to identify variants of tumour paired-end RNA-seq data using Mutect2. Please ignore more advanced operations such as using tumour and normal matched sample pair or PoN creation. I will use a simple example here.
I used MACS2 to map a pair of fastq files of R1 and R2 on the genome (I used CDS sequences as 'genome', thus MACS2 might able to be used for RNA-seq) and generated a bam file containing paired-end reads. As you can see in the image below it has many variants relative to the genome (e.g. BAM image in IGV: changed from G to A at position 156).
But when I try to use mutant2 to identify variants such as position 156 (covering the entire genome)：
java -Xmx1t -jar gatk-package-18.104.22.168-local.jar Mutect2 -R current.genome.fa -I cancer_T1.sorted.bam --native-pair-hmm-threads 50 --max-mnp-distance 0 -O test_cancer_N1.sorted.vcf.gz --bam-output test.bam
The output is 0 variant found:
$ zcat test_cancer_N1.sorted.vcf.gz | head
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions of alternate alleles in the tumor">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=FAD,Number=R,Type=Integer,Description="Count of fragments supporting each allele.">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
$ zcat test_cancer_N1.sorted.vcf.gz | tail
##filtering_status=Warning: unfiltered Mutect 2 calls. Please run FilterMutectCalls to remove false positives.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT cancer_N1
All other paired-end RNA-seq data, regardless of species, produced almost no variant results like the above example. When splitting R1 and R2 into single-end and calling variants separately, a large number of variant results will be produced for each of them.
I don't understand what I'm doing wrong with paired-end RNA-seq data, or that Mutect2 was not originally designed for paired-end RNA-seq data.
I know it is wrong to split paired-end into two single-ends and run them separately, it will miss a lot of information. Ask for any valid suggestions to get paired-end RNA-seq data working properly in Mutect2.
Please sign in to leave a comment.