Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Joint Call WGS with a Large Cohort

0

5 comments

  • Avatar
    Saeed Farajzadeh Valilou

    I'm also have the similar issue. I am working on a similar project, a cohort of ~5,000 human samples of WGS data. I want to perform joint call for all CRAM files. Since, the size of samples are larger for joint call and computing the calling become harder, I want to divide CRAMs into chunks. I was wondering if somebody helps me what are consensus and established "size" to divide CRAMs into multiple smaller chunks and merging them after calling? 

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi Matt Johnson

    The provided WGS calling intervals split at Ns in the reference so there are no boundary effects.  However in hg38 there are fewer and fewer of those.  In our production pipeline, we do further split the intervals with the GATK SplitIntervals tools ( see https://github.com/broadinstitute/warp/blob/develop/tasks/broad/JointGenotypingTasks.wdl for an example command)

    For WGS we use number of scatters equal to 2.5X the number of samples/input GVCFs.  (That's equivalent to setting unbounded_scatter_count = 2.5 in https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/joint_genotyping/JointGenotyping.wdl)  So for 4000 human samples you'd have about 10K scatters.  Give that a shot and see if it helps.

    -Laura

    2
    Comment actions Permalink
  • Avatar
    kvn95ss

    Hay, since the number of samples is high, have you tried ReblockGVCF for the samples? Do look into it, it reduces memory and I/O requirements by quite a lot.

    0
    Comment actions Permalink
  • Avatar
    Matt Johnson

    Laura Gauthier, thanks! This was immensely helpful, as we were trying to hold on to not subdividing intervals. Certainly gives us a better option that doesn't take multiple months.

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Great!  And do look into ReblockGVCF like kvn95ss suggested.  If you're not super concerned about hom-ref GQ precision, it can be a 90% reduction in GVCF size, which speeds up execution time when there are fewer ref blocks to deal with.  The most aggressive option is a [0-20) low confidence band and a [20,99] high confidence band, which was the schema that was used for gnomADv3.  ReblockGVCF will also drop alleles not used in the most likely genotypes, which can dramatically improve the PL sizes at gross indel sites.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk