Pipeline Question: where to drop samples that have bad DNA quality
REQUIRED for all errors and issues:
a) GATK version used: 4.4.0.0
b) Exact command used: pipeline of haplotype caller, GenomicsDBImport, GenotypeGVCFs, VariantFiltration, and SelectVariants.
Hello I am writing out a pipeline to discover SNP variants for 700 samples of non-model organism of the same species, facilitated by a reference genome.
My 700 samples are of different DNA quality, some of them have limited DNA input and yielded limited reads. Typically the low quality samples can be dropped at fasta level by sizes of the files or total # reads, but my previous work flow with bcftools and vcftools have been focusing on examining the sample-level variant percentage missingness using R package VCFR, which reads the ./. genotype as missing, and count percentage of each sample across all the variants. I believe removing low-DNA samples from the multisample vcf can improve the variant filtration statistics such as the INFO-variant level DP and AD, etc, to retain more good variants.
My issue now with the new gatk pipeline I am writing is that I could not find a good way to calculate sample-level missingness with a multisample vcf as the input, especially complicating the issue with missing genotype is now denoted as 0/0 with DP =0 in GATK, which makes it hard to take the vcf file to vcftools pipeline.
My sense is as along as I find a GATK-inbuilt statistics to measure sample-level quality, I can specify a threshold and create a list of sample sames that fall below the stats threshold, and then I can use SelectVariant to subset the multisample vcf, effectively removing the low-quality samples while automatically recalculating AD and DP in the INFO field.
Can someone here help in pointing out where in the SNP discovery pipline and how to integrate this with GenotypeGVCF, VariantFiltration, and SelectVariant?
Best,
-
Hi Dahn-young Dong,
No-call doesn't really have a strict definition, but I would agree that anything with no depth is worth excluding. I'd recommend you trust DP over genotype call. To calculate sample-level missingness, I'd actually suggest hail, though it can be challenging to set up a hail cluster if you're not running on Google Cloud Platform. You may find the gnomAD sample QC repo helpful if you go that route: https://github.com/broadinstitute/gnomad_qc/blob/main/gnomad_qc/v4/sample_qc/README.md Otherwise, the Picard variant_calling_detail metrics PCT_GQ0_VARIANTS might be a good proxy for percent no-calls or percent missingness. You can get that from running CollectVariantCallingMetrics: https://gatk.broadinstitute.org/hc/en-us/articles/21905092095771-CollectVariantCallingMetrics-Picard-
If you want to exclude samples before you get to variant calling you could try a couple of approaches. I personally haven't used the Picard EstimateLibraryComplexity tool, but this is the scenario it's trying to address: small amounts of input DNA that may have lots of duplicates. Docs here: https://gatk.broadinstitute.org/hc/en-us/articles/21904965033883-EstimateLibraryComplexity-Picard- Depending on your organism's reference genome quality you might be able to set a coverage threshold. The Picard WGS metrics have a variety of coverage values that might be of use: https://broadinstitute.github.io/picard/picard-metric-definitions.html#CollectWgsMetrics.WgsMetrics. (I'm assuming you're doing whole genome sequencing, but there are also hybrid selection metrics for similar uses.)
Please sign in to leave a comment.
1 comment