In case there is no vcf file from genome project, is it still possible to use the GetPileupSummaries?
I am running a Somatic variant discovery protocol on plant samples.
Currently at the GetPileupSummaries stage.
This stage requires a common germline variant sites VCF file.
However, there is no such data for studied genome.
1) Is there a way to run GetPileupSummaries without this file?
* It seems that result from GetPileupSummaries is required for the next stage of CalculateContamination, and results form the latter is used by FilterMutectCalls. This makes GetPileupSummaries essential in this piepline.
2) What is the best way to generate such vcf?
Would it make sense to use vcf resulting from Mutect2 (although it is not germlined based)?
If so, could not use the vcf produced by Mutect2 for GetPileupSummaries as GetPileupSummaries exits with ERROR.
a) GATK version used: gatk4-v22.214.171.124
b) Commands used:
c) Entire program log:
1. Running Mutect2 to find somatic mutations in a plant.
Used "Tumor-only mode":
gatk Mutect2 -R Ref_genome.fa -I PlantSample1.bam -O PlantSample1.vcf.gz
[This command made also these 2 files: PlantSample1.vcf.gz.stats and PlantSample1.vcf.gz.tbi].
2. Running GetPileupSummaries
gatk GetPileupSummaries -I PlantSample1.bam -V PlantSample1.vcf.gz -L PlantSample1.vcf.gz -O PlantSample1_pileups.table
Exit on Error: "A USER ERROR has occurred: Bad input: Population vcf does not have an allele frequency (AF) info field in its header."
It looks like GetPileupSummaries requires AF input in this format: "AF=0.063" inside the info column of the vcf file.
[based on https://gatk.broadinstitute.org/hc/en-us/articles/360037593451-GetPileupSummaries]
However, inspecting vcf file generated by Mutect2, it seems that it contains AF in a different format
GT:AD:AF:DP:F1R2:F2R1:SB in "FORMAT" coulmn
0/1:0,1:0.667:1:0,0:0,1:0,0,1,0 in the last column
Thus, in this vcf file generated by Mutect2 AF is 0.667.
However there is no explicit "AF=" in the file.
I assume this is the reason GetPileupSummaries exit on this error: "Population vcf does not have an allele frequency (AF) info field in its header.".
Thank you for your help in advance,
Data from gatk protocols:
Hi Arye Harel,
Thank you for writing to the GATK forum! Before I bring this to our developers, can you confirm that what you posted is your entire program log? If not, please include it in its entirety.
I look forward to hearing back from you!
The entire log is very long, I present the important details.
More important- I am not sure it makes sense to use vcf resulting from Mutect2.
Thus the important questions are:
* Is there a way to run GetPileupSummaries without this file?
* What is the best way to generate such vcf?
* Considering we are using "Tumor-only mode" without VCF of common germline variant sites (or a PoN), is it still more correct to use Mutect2 over HaplotypeCaller to identifying somatic mutations?
Thank you for your help,
Hi Arye Harel,
Thank you for your much-appreciated patience while I reviewed your inquiry with our developers. I received some feedback to share with you.
The short answer is no regarding your question on running GetPileupSummaries without the vcf. The vcf is not optional and is required to run the tool.
Unfortunately, your Mutect2 output will not help identify somatic mutations in your case. Running in tumor-only mode will output a ton of germline mutations, not somatic ones.
The workaround our developers suggest is to use HaplotypeCaller instead.
1. You'll need to run the tool with multiple samples (between 20-50).
2. Set your allele frequency cut off near 1/n or 2/n. These specifications will give you some rare variants, but that shouldn't matter.
Here are some additional resources that may prove helpful:
-BQSR Bootstrapping Forum Post
-GATK Known Variants - Bootstrapping Article
I hope this helps! Please let me know if this leads you to success. If any other questions come up in the meantime, please do not hesitate to reach out.
Thank you very much.
>You'll need to run the tool with multiple samples (between 20-50).
Will it work well for smaller experiment of 12 samples?
>Set your allele frequency cut off near 1/n or 2/n.
Could not find "allele frequency" in the HaplotypeCaller manual.
Considering you have also recommended running multiple samples together which is not typical for haplotypecaller is it possible you have ment I should use "UnifiedGenotyper" (an old version of HaplotypeCaller) ?
Please sign in to leave a comment.