Find rate of homozygosity
AnsweredHow might I see some basic statistics on a VCF file, in particular, which proportion of the variants are homozygous across the first 22 chromosomes? It would also be nice to see counts for heterozygous and homozygous variants.
I've been trying to find a way to do this for a couple days, but I'm out of time, I would like to find a way today.
-
Thank you for your post, Joyce Anon! I want to let you know we have received your question. We'll get back to you if we have any updates or follow up questions.
Unfortunately, our team is not able to provide support within one day. Please see our Support Policy for more details about how we prioritize responding to questions.
-
I figured out a way, thank you. 40% of the 4.4 million variants in the first 22 chromosomes (default settings HaplotypeCaller) are homozygous, is this typical? I can't seem to find typical numbers anywhere.
-
After more searching, and a lucky period of better-than-usual brain function, I found my answer, these numbers are typical. The longest run of homozygosity is about 500k bp (from AutoMap), which is also typical.
I got the numbers from a slightly modified a script from another forum that produces a table, then counted it with a regex search. This was the script:
awk -v OFS="\t" '$0 !~ "^#" {hom_ref = 0; hom_alt = 0; het = 0; for(i=10;i<=NF;i++) { if($i ~ /0\|0/ || $i ~ /0\/0/) hom_ref++; else if($i ~ /1\|1/ || $i ~ /1\/1/) hom_alt++; else het++; } print $1, $2, hom_ref, hom_alt, het}' ./output/nebulaTranchFiltered.vcf > homo.tsv
Thank you, my questions are answered now.
-
Thank you for updating this post Joyce Anon! I'm sure this will be helpful for other GATK users in the future.
Please sign in to leave a comment.
4 comments