SNP analysis returning way too many variants
Hi there
I have been running the GATK pipeline (from docker:latest_release) on a set of WES samples, but I am afraid it is returning way too many variants in the end. The sequencing is performed by AmpliSeq panel from IonTorrent, and I am using the output .bam file to run the subsequent analysis. The pipeline running goes as follows:
1. Running Mutect2
Mutect2 -R hg19.fasta -I sample1.bam -O sample1_unfiltered.vcf --f1r2-tar-gz sample1_f1r2.tar.gz -pon Mutect2_hg19_exome.vcf --germline-resource af-only-gnomad_IT.raw.sites.b37.vcf.gz -L AmpliSeqExome.bed
Here, I opted to use the -pon and --germline-resource options, which were retrieved from a PoN panel devised by Broad and gnomAD, respectively. The BED file is provided by the sequencer.
2. Learn orientation bias
Since I am using [FFPE](https://software.broadinstitute.org/gatk/blog?id=23400) samples, this step seems neccessary.
gatk LearnReadOrientationModel -I sample1_f1r2.tar.gz -O sample1_ROM_model.tar.gz
3. Tabulating pileup metrics for contamination inference
Now, running GetPileupSummaries. Info here.
gatk GetPileupSummaries -I sample1.bam -V ExAC_hg19_IonTorrent_BiallelicOnly.r1.sites.vep.vcf.gz -L ExAC_hg19_IonTorrent_BiallelicOnly.r1.sites.vep.vcf.gz -O sample1_pileups.table
4. Estimating sample contamination
Here, the fraction of reads coming from cross-sample contamination is calculated, given the output from GetPileupSummaries.
gatk CalculateContamination -I sample1_pileups.table -tumor-segmentation sample1_segments.table -O sample1_calcuContamination.table
5. Filtering somatic SNV's and indels
gatk FilterMutectCalls -R hg19.fasta -V sample1_unfiltered.vcf --tumor-segmentation sample1_segments.table --contamination-table sample1_calcuContamination.table --ob-priors sample1_ROM_model.tar.gz -O sample1_Filtered.vcf
After that, from the resuting output sample1_Filtered.vcf, I use vcftools to retrieve only those PASS variants:
vcftools --vcf sample1_Filtered.vcf --out sample1_FiltPASS --remove-filtered-all --recode
The resulting vcf files shows 72,358 variants.
When trying using the recommended VQSR/VQSLOD functions for filtering SNP's and INDEL's, the result is even greater, retrieving 412,000 variants.
And of course, when running the Funcotator is equally large.
The issue with that is mainly I am sure there might be something wrong with this pipeline I put together (maybe I have added some redundant steps, or filtering for the wrong criteria). And the other problem is, even if using hard filtering of this sort of output is hard to know where to begin with.
If you have experienced something like that before, or see something awfully wrong on this pipeline, I will be very glad to have your feedback either way. (:
Thanks a lot!
-
Hi dodausp,
Thank you for the thorough post regarding your pipeline! Could you please provide more information about your expected results so that we can have a better idea of where the problem is occurring? I would recommend getting the -bamout from Mutect2 and looking at the results in IGV to get some specific examples. We also have some resources that you might find helpful for troubleshooting:
- When HaplotypeCaller and Mutect2 do not call an expected variant
- (How to) Call somatic mutations using GATK4 Mutect2
Best,
Genevieve
-
Make sure that vcf tools is actually applying your filters.
In high depth samples, lots of low quality variants creep in:
Please sign in to leave a comment.
2 comments