Seeking Advice on Optimizing the GenomicDBImport-GenotypeGVCFs Workflow for Large Datasets
Hi,
I apologize for revisiting a previously discussed topic regarding the optimization of the GenomicDBImport-GenotypeGVCFs workflow. Despite having gone through discussions like
I still have unresolved questions. I am working with a sizable dataset, approximately 100k human exome sequences, which has become quite challenging during the GenotypeGVCF step.To address issues related to interval size nonuniformity across chromosomes (e.g., chr1 is significantly larger than chr21), I've divided the intervals into 50 parts using the following command:
--reference ${REF} \
--output ${OUTPUT} \
--intervals ${REF_CAPTURE} \
--scatter-count 50 \
--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
--interval-merging-rule OVERLAPPING_ONLY
This method splits each chromosome into 50 roughly equal parts.
--genomicsdb-workspace-path ${GENOMICDB} \
--batch-size 50 \
--sample-name-map ${SAMPLE_LIST} \
--tmp-dir ${TMP_DIR} \
--bypass-feature-reader \
--merge-input-intervals \
--reader-threads 5 \
--intervals ${INTERVAL}
--reference ${REF_HG38} \
--variant gendb://${GENOMICDB} \
--output ${VCF_OUTPUT} \
--tmp-dir ${TMP_DIR} \
--annotation-group StandardAnnotation \
--annotation-group AS_StandardAnnotation
Despite allocating 100G of memory, the actual memory usage peaks at about 40G. After 30 minutes of engine initialization and 2.5 hours of processing, the output VCF only contains about 20 records in 100bp. Given these circumstances, my questions are:
- Should I consider splitting the intervals into smaller segments? I am hesitant because managing and combining hundreds of VCF parts per chromosome could become cumbersome and time-consuming.
- Is it feasible to divide the 100k samples into chunks of approximately 100 samples each, perform genotype processing on each, and then merge the resulting VCFs? I require specific site information for later use in VQSR.
- I have not modified settings such as --max-alternate-allele and --max-genotype-count, given the diversity expected from 100k samples. Should I adjust these parameters?
- Are there any additional command options or settings that I might have overlooked that could optimize my workflow?
For reference, I have access to a machine with approximately 120 cores and 1000GB of memory. Any suggestions or advice on enhancing my workflow would be greatly appreciated, as this issue has been a persistent challenge.
Thank you for your time and assistance.
-
Hi chaoyi shan
For this many samples we would highly recommend you to follow our "Biggest Practices" to get the job done.
Before doing any imports, our Biggest Practices require Reblocking of GVCF files to
- Reduce the data size
- Removing low quality sites
- Adding additional annotations that will make genotyping process faster using a different tool.
We highly recommend using ReblockGVCFs tool to reblock your GVCF files.
Firstly, when you are importing into a GenomicsDB instance, if you wish to use threaded readers then you need to provide a single interval which you might want to cover multiple target regions at once. Once you setup larger intervals that do not overlap and cut in between variant sites then you can run them in parallel to import your regions with threaded readers.
To get the import function faster you may need to increase the amount of heapsize to around 16G to make sure that you have enough room for imports. If not you may need to reduce the number of samples per batch from 50 to 20-30.
To reduce the number of batch layers for each import we recommend using the consolidate parameter
--consolidate <Boolean> Boolean flag to enable consolidation. If importing data in batches, a new fragment is
created for each batch. In case thousands of fragments are created, GenomicsDB feature
readers will try to open ~20x as many files. Also, internally GenomicsDB would consume
more memory to maintain bookkeeping data from all fragments. Use this flag to merge all
fragments into one. Merging can potentially improve read performance, however overall
benefit might not be noticeable as the top Java layers have significantly higher
overheads. This flag has no effect if only one batch is used. Defaults to false Default
value: false. Possible values: {true, false}Please keep in mind that GenomicsDB instance is a plain file format that needs to be backed up just in case if you need to add more samples later or to protect your work from a possible corruption. Make sure that the data is stored on a local storage and not on a shared drive since network outages or lags may also slow down the whole process. Also operating systems do not behave similarly in terms of IO ops when working on a network location or a local file system.
As for the genotyping step instead of using GenotypeGVCFs we recommend using GnarlyGenotyper for very high number of samples such as yours. It is significantly faster and uses less memory to do the work.
There are differences between 2 tools in terms of how QUAL is calculated but GnarlyGenotyper uses the annotations provided by the reblocking step to get the job done in a faster way. If you have at least 80~85% sites covered at 20X coverage you should get all the sites that you are interested to be called as our regular "Best Practices".
I hope this helps.
-
@Gökalp Çelik, thank you for your intriguing advice. Does this mean that I should create a single GenomicDB workspace encompassing all the sites I need, instead of creating separate workspaces for each interval or chromosome? Additionally, are there appropriate arguments in GnarlyGenotyper to enable parallel processing?
-
Single GenomicsDB is not a good idea for this many samples. You need to have as many scatters as your can without splitting a variant region in half. Each scatter should have only a single interval that may encompass multiple exome targets. GenomicsDBImport may have threaded readers however that cannot be considered a parallel processing per se.
Actual parallelization comes from running each scattered interval simultaneously. Once you obtain your imports you need to genotype them simultaneously in parallel and finally combine all VCF files into a single one to get your final variant calls.
I hope this helps.
-
Could you please clarify if you are suggesting using SplitIntervals with the following argument? How can I divide my BED file into the maximum number of scatters possible? Thank you very much for your kind assistance in advance.
SplitIntervals --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION
-
This is a good start however certain exome capture targets are almost too close to each other therefore there is a possibility that some of those scatters might split a variant region in half. We recommend you to split your intervals from regions where targets are far enough to ensure that there is no off target variant site in between. This may require handcrafting your way out.
I hope this helps.
-
Hi
In the "Reblocking and the GnarlyGenotyper" section of the Biggest Practice, it was noted that for gnomADv3, a highly aggressive strategy was adopted with two quality bands: pass ≥20 and fail <20. It was also mentioned that the same ReblockGVCF tool is used to eliminate any alternate alleles that are either not called or are called with a Genotype Quality (GQ) less than 20.
Could you please clarify how I can adjust the arguments in the ReblockGVCF tool to align with these criteria? Does this mean that I need to set the
--standard-min-confidence-threshold-for-calling
to 20 instead of the default 30? Or, should the-rgq-threshold 10
be adjusted to 20?Thank you in advance for your helpful response.
-
Hi again. You don't need to change any of those parameters. Just leave them default.
Regards.
Please sign in to leave a comment.
7 comments