Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Seeking Advice on Optimizing the GenomicDBImport-GenotypeGVCFs Workflow for Large Datasets

0

7 comments

  • Avatar
    Gökalp Çelik

    Hi chaoyi shan

    For this many samples we would highly recommend you to follow our "Biggest Practices" to get the job done. 

    https://gatk.broadinstitute.org/hc/en-us/articles/16957867036315-Introducing-GATK-Biggest-Practices-for-Joint-Calling-Supersized-Cohorts 

    Before doing any imports, our Biggest Practices require Reblocking of GVCF files to 

    - Reduce the data size

    - Removing low quality sites

    - Adding additional annotations that will make genotyping process faster using a different tool. 

    We highly recommend using ReblockGVCFs tool to reblock your GVCF files. 

    Firstly, when you are importing into a GenomicsDB instance, if you wish to use threaded readers then you need to provide a single interval which you might want to cover multiple target regions at once. Once you setup larger intervals that do not overlap and cut in between variant sites then you can run them in parallel to import your regions with threaded readers. 

    To get the import function faster you may need to increase the amount of heapsize to around 16G to make sure that you have enough room for imports. If not you may need to reduce the number of samples per batch from 50 to 20-30. 

    To reduce the number of batch layers for each import we recommend using the consolidate parameter 

    --consolidate <Boolean>       Boolean flag to enable consolidation. If importing data in batches, a new fragment is
                                  created for each batch. In case thousands of fragments are created, GenomicsDB feature
                                  readers will try to open ~20x as many files. Also, internally GenomicsDB would consume
                                  more memory to maintain bookkeeping data from all fragments. Use this flag to merge all
                                  fragments into one. Merging can potentially improve read performance, however overall
                                  benefit might not be noticeable as the top Java layers have significantly higher
                                  overheads. This flag has no effect if only one batch is used. Defaults to false  Default
                                  value: false. Possible values: {true, false} 

    Please keep in mind that GenomicsDB instance is a plain file format that needs to be backed up just in case if you need to add more samples later or to protect your work from a possible corruption. Make sure that the data is stored on a local storage and not on a shared drive since network outages or lags may also slow down the whole process. Also operating systems do not behave similarly in terms of IO ops when working on a network location or a local file system. 

    As for the genotyping step instead of using GenotypeGVCFs we recommend using GnarlyGenotyper for very high number of samples such as yours. It is significantly faster and uses less memory to do the work. 

    There are differences between 2 tools in terms of how QUAL is calculated but GnarlyGenotyper uses the annotations provided by the reblocking step to get the job done in a faster way. If you have at least 80~85% sites covered at 20X coverage you should get all the sites that you are interested to be called as our regular "Best Practices".  

    I hope this helps. 

     

    0
    Comment actions Permalink
  • Avatar
    chaoyi shan

    @Gökalp Çelik, thank you for your intriguing advice. Does this mean that I should create a single GenomicDB workspace encompassing all the sites I need, instead of creating separate workspaces for each interval or chromosome? Additionally, are there appropriate arguments in GnarlyGenotyper to enable parallel processing?

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Single GenomicsDB is not a good idea for this many samples. You need to have as many scatters as your can without splitting a variant region in half. Each scatter should have only a single interval that may encompass multiple exome targets. GenomicsDBImport may have threaded readers however that cannot be considered a parallel processing per se. 

    Actual parallelization comes from running each scattered interval simultaneously. Once you obtain your imports you need to genotype them simultaneously in parallel and finally combine all VCF files into a single one to get your final variant calls. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    chaoyi shan

    Could you please clarify if you are suggesting using SplitIntervals with the following argument? How can I divide my BED file into the maximum number of scatters possible? Thank you very much for your kind assistance in advance.

    SplitIntervals --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION
    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    This is a good start however certain exome capture targets are almost too close to each other therefore there is a possibility that some of those scatters might split a variant region in half. We recommend you to split your intervals from regions where targets are far enough to ensure that there is no off target variant site in between. This may require handcrafting your way out. 

    I hope this helps. 

    0
    Comment actions Permalink
  • Avatar
    chaoyi shan

    Hi

    In the "Reblocking and the GnarlyGenotyper" section of the Biggest Practice, it was noted that for gnomADv3, a highly aggressive strategy was adopted with two quality bands: pass ≥20 and fail <20. It was also mentioned that the same ReblockGVCF tool is used to eliminate any alternate alleles that are either not called or are called with a Genotype Quality (GQ) less than 20.

    Could you please clarify how I can adjust the arguments in the ReblockGVCF tool to align with these criteria? Does this mean that I need to set the --standard-min-confidence-threshold-for-calling to 20 instead of the default 30? Or, should the -rgq-threshold 10 be adjusted to 20?

    Thank you in advance for your helpful response.

     

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. You don't need to change any of those parameters. Just leave them default. 

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk