Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Low coverage WGS data

0

3 comments

  • Avatar
    Laura Gauthier

    Hi Goran Rakocevic,

    Unfortunately with the low-pass WGS study design, you're going to be limited to common variants, but on the other hand seeing a variant in multiple samples is pretty good evidence. This is what we did for the first phase of the 1000 Genomes project and the final callset was pretty high quality, though definitely enriched for common variants.

    Yes, I would recommend reducing the min-pruning argument, potentially even to zero. Then you'll rely on the site-level QUAL score to ensure that the likelihood of the site being variant is higher than the (human) SNP rate of 1/1000, i.e. QUAL >= 30.  I think that ends up being at least four ALT reads across samples, but that's a very rough estimate and it will require more evidence for low base qualty, low mapping quality, and potentially indels. VQSR should take care of systematic errors like strand bias. Single-sample GVCF mode should be fine. I don't think you'll be able to provide HaplotypeCaller with 15,000 input bams.  There are strange limitations like UNIX commandline length or the number of file handles the operating system will allow open at the same time.  I don't think there's a way to create a multi-sample GVCF with multiple input bams, so I would suggest you follow the typical best practices GVCF->GenomicsDB->GenotypeGVCFs workflow with the additional pruning change for haplotypecaller.

    0
    Comment actions Permalink
  • Avatar
    Goran Rakocevic

    Hi Laura Gauthier

    Thanks!

    I didn't mean to shove 15,000 BAMs into a single HC call. I was thinking of maybe doing random batches of 50 or so... 

    That will cause batch effects, just wondering if that would help or hurt more.

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    HaplotypeCaller won't do multi-sample GVCFs and you're going to run into issues combining multiple multi-sample genotyped VCFs.  Most importantly, you won't have reference information to get confident allele frequencies across all samples.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk