Joint genotyping as cohort size grows
Hello,
I am trying to make a workflow for germline variant calling to call SNPs. I start off with aligning (bwa-mem2), variant calling (HaplotypeCaller in gvcf mode) and joint genotyping (GenomicsDBImport & GenotypeVCFs).
Consider the situation below:
I have 50 samples and I run the workflow which generates 1 cohort VCF. Next time when I have another 50 samples for the same study, at which stage of the pipeline do I insert the calls from previous run? Is it the GenomicsDBImport or GenotypeVCFs stage?
-
Hi Asma Riyaz
GenomicsDB accepts incremental updates to its variant storage therefore you can add more samples each time you have. Once you update your GenomicsDB you may regenotype to get a new updated set of variants containing all your samples. Keep in mind that, as the number of samples increase your need for GenomicsDB storage and time and compute resources will increase as well. In order to keep resources in check you may want to use our new feature of ReblockGVCF tool which reduces the amount of storage needed to contain your variants by removing less confident calls and merging reference confidence blocks into lesser distinct levels.
I hope this helps.
-
Hello again,
Does the following command look alright to you (consider this is the second time I am running the pipeline in order to add samples for joint genotyping), here previous_db is the DB generated the first time the pipeline was run for samples 1 to 50.
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \ -V data/gvcfs/sample51.g.vcf.gz \ -V data/gvcfs/sample52.g.vcf.gz \ -V previous_db \ --genomicsdb-workspace-path my_database \ --tmp-dir=/path/to/large/tmp \ -L 20
-
Hi again.
GenomicsDBImport has a different parameter in case of incremental updates. You just need to use the parameter
--genomicsdb-update-workspace-path
and give the genomicsdb path that you have to this parameter when you are performing incremental updates.
One thing to note that once you do this each time a new increment is done a new subdirectory will be formed under genomicsdbimport folder therefore to make genotyping faster and more convenient we have another parameter called
--consolidate
to prevent too many fragments to occur in the imported collection.
I hope this helps.
Please sign in to leave a comment.
3 comments