CalculateGenotypePosteriors prior - should population prior be supplied to a trio vcf?
AnsweredHi there,
I'm following the genotype refinement workflow (https://gatk.broadinstitute.org/hc/en-us/articles/360035531432-Genotype-Refinement-workflow-for-germline-short-variants) and doing CalculateGenotypePosteriors. I got four questions
1. I have a big vcf file of 109 trio families. To do CalculateGenotypePosteriors, I first split my big vcf into109 trio vcf files by using selectvariant, and then with each trio vcf file, I did CalculateGenotypePosteriors with .ped as the family prior. I'm wondering - for such analysis with only 3 samples in a trio vcf, would you recommend to supply population prior as well? I saw 'By default, priors will be applied to each variant separately, provided each variant features data from at least 10 called samples (no-calls do not count)' in the document (https://gatk.broadinstitute.org/hc/en-us/articles/360037226592-CalculateGenotypePosteriors). I'm confused in my case of only 3 called samples in a vcf, this default condition may not hold and thus may lead to some false positives?
2. This statement - By default, priors will be applied to each variant separately, provided each variant features data from at least 10 called samples - did make me confused... does it mean I CANNOT do CalculateGenotypePosteriors for a trio vcf of 3 samples only? in what condition/scenario does this 'at least 10 called samples' hold?
3. right now I do CalculateGenotypePosteriors for each trio vcf individually, so I have to do CalculateGenotypePosteriors 109 times (I have 109 trio vcf files in total). I'm wondering is it okay to do CalculateGenotypePosteriors for my big vcf with all samples from 109 families and supply a .ped file for all the 109 families without splitting into single trio vcf files? In such case I just need to do CalculateGenotypePosteriors once. Would the posterior be better if I do all the 109 families all together in one big vcf compared to do one family at a time?
4. I know CalculateGenotypePosteriors is for diploid only so I have to excl. regions on chrX & chrY, which I used haploid mode in haplotypecaller. However, I'm wondering are you thinking of developing a similar tool for haploid or mixed ploidy? sex chromosomes also play an important role in many diseases and it would be great if I can still consider them in running my workflow.
Thank you so much for your help!
REQUIRED for all errors and issues:
a) GATK version used: gatk/4.2.1.0
b) Exact command used:
gatk --java-options '-Xmx40g -Xms32g -XX:ParallelGCThreads=2' \
CalculateGenotypePosteriors \
-V $per_fam_vcf \
-ped $per_fam_ped \
-XL chrX \
-XL chrY \
-O ${outdir_per_fam}/Chaklab_posterior_GT_${famID}.vcf.gz \
--skip-population-priors
c) Entire program log:
23:33:41.682 INFO CalculateGenotypePosteriors - ------------------------------------------------------------
23:33:41.683 INFO CalculateGenotypePosteriors - HTSJDK Version: 2.19.0
23:33:41.683 INFO CalculateGenotypePosteriors - Picard Version: 2.19.0
23:33:41.683 INFO CalculateGenotypePosteriors - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:33:41.683 INFO CalculateGenotypePosteriors - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:33:41.683 INFO CalculateGenotypePosteriors - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:33:41.683 INFO CalculateGenotypePosteriors - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:33:41.683 INFO CalculateGenotypePosteriors - Deflater: IntelDeflater
23:33:41.683 INFO CalculateGenotypePosteriors - Inflater: IntelInflater
23:33:41.683 INFO CalculateGenotypePosteriors - GCS max retries/reopens: 20
23:33:41.683 INFO CalculateGenotypePosteriors - Requester pays: disabled
23:33:41.683 INFO CalculateGenotypePosteriors - Initializing engine
23:33:42.000 INFO FeatureManager - Using codec VCFCodec to read file file:///gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG_union/statistical_analysis/family_based/DNM/per_fam_vcf/438/Chaklab_438.vcf.gz
23:33:42.306 INFO IntervalArgumentCollection - Initial include intervals span 3217346917 loci; exclude intervals span 213268310 loci
23:33:42.307 INFO IntervalArgumentCollection - Excluding 213268310 loci from original intervals (6.63% reduction)
23:33:42.308 INFO IntervalArgumentCollection - Processing 3004078607 bp from intervals
23:33:42.342 INFO CalculateGenotypePosteriors - Done initializing engine
23:33:42.379 INFO PedReader - Reading PED file /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG_union/statistical_analysis/family_based/DNM/input_file/per_fam_ped/438.ped with missing fields: []
23:33:42.382 INFO PedReader - Phenotype is other? false
23:33:42.475 INFO ProgressMeter - Starting traversal
23:33:42.475 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
23:33:51.692 INFO ProgressMeter - chr22:38107840 0.2 55532 361497.2
23:33:51.693 INFO ProgressMeter - Traversal complete. Processed 55532 total variants in 0.2 minutes.
23:33:51.761 INFO CalculateGenotypePosteriors - Shutting down engine
[April 14, 2022 11:33:51 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.CalculateGenotypePosteriors done. Elapsed time: 0.20 minutes.
Runtime.totalMemory()=35113664512
Using GATK jar /gpfs/share/apps/gatk/4.1.2.0/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx40g -Xms32g -XX:ParallelGCThreads=2 -jar /gpfs/share/apps/gatk/4.1.2.0/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar CalculateGenotypePosteriors -V /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG_union/statistical_analysis/family_based/DNM/per_fam_vcf/438/Chaklab_438.vcf.gz -ped /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG_union/statistical_analysis/family_based/DNM/input_file/per_fam_ped/438.ped -XL chrX -XL chrY -O /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG_union/statistical_analysis/family_based/DNM/per_fam_posterior_GT/438/Chaklab_posterior_GT_438.vcf.gz --skip-population-priors
_ESTATUS_ [ CalculateGenotypePosteriors]: 0
_END_ Thu Apr 14 23:33:51 EDT 2022
See forum topic details at forum guidelines page: https://gatk.broadinstitute.org/hc/en-us/articles/360053845952-Forum-Guidelines
-
Hi Mingzhou Fu,
Here are the answers to your questions:
- Yes, we would recommend using population priors as well as the family priors. Using these two resources will together make the results much more accurate.
- There are three possible priors used with CalculateGenotypePosteriors. These are the trio priors, the population priors, and internal priors. Internal priors are based on the allele frequencies in the callset and will only be used and accurate if there are 10 samples. The other two sources of information will be fine with less than 10 samples.
- You can run CalculateGenotypePosteriors on the joint callset with the whole PED file. The posterior would probably be better because the tool will be able to calculate the internal priors.
- We don't have any plans to add a haploid mode for this tool, unfortunately.
Thank you for posting these questions!
Best,
Genevieve
Please sign in to leave a comment.
1 comment