Case v Control variant comparisons
Dear GATK community,
I have recently performed targeted (deep) sequencing of ~120 gene exons related to a specific phenotype of asthma. I have an n=120 target patients and a small cohort of age and gender matched controls screened for the absence of the phenotype (n=30). This is my first experience of sequencing and I apologise if some of these questions are naieve.
Using SnipSift I have generated case v control comparisons for specific variants but I have many questions regarding the results. I would be extremely grateful if anyone in the community could help me with the following questions.
1. For some variants it appears that both cases and controls are 100% heterozygotes. Both of these groups are north west europeans and I wonder if this result (and an apparent 'variant alternate allele call' representing, in effect, normality) is a result of the reference genome being used (gnomAD 2.1)? I have not yet scrutinized the individual vcf files but I assume these results are all germline [haplotype caller - SnpSift]. Should I just ignore results like this?
2. I plan to look at results generated by SnpSift in IGV per gene to identify linkage disequilibrium. I will also use PLINK. What is a reasonable hierarchical approach to reduce the list of results generated by SnpSift and the number of statistical comparisons that require correction? Is it legitimate to ignore all variants that are not likely to affect proteins (e.g., synonymous variants) and those described above that are present in all case and control populations? I was thinking of increasing power by using a much larger biobank population of matched controls which may see some interesting comparisons retained after correction for multiple comparisons. Has this been done before?
Many thanks for your advice!
James
-
Hi,
We typically only support technical questions regarding how to use our GATK software, but I'll try to give a few comments about your questions (other users are of course welcome to provide some scientific advice if they wish).
1. I'm not familiar with the SnpSift software, but seeing 100% heterozygotes at a site across 120+ individuals does sound a little suspect, especially if this happens at many sites in your cohort. You might be interested in investigating statistics like Excess Heterozygosity if your samples are unrelated. This is somewhat similar to using Hardy-Weinberg equilibrium as a heuristic to measure how the genotype distribution differs from a "population expected" one, when it makes scientific sense, i.e. there's not strong selection pressure against the existence of homozygous variant samples. Being a heterozygote at a site is reference independent, so I don't think this would be related to ancestry or particular references used.
2. Questions about "scientific legitimacy" are probably better aimed at your research area's community as they can be application specific, but it sounds like you might be interested in functional annotation tools like our Funcotator, which can leverage databases to annotate variants based on existing criteria. I think gnomAD might have labels like missense mutations, etc for common variants you might be able to check out, and possibly save some time in IGV.
If you have further questions about using our tools, please feel free to post. Thanks!
Please sign in to leave a comment.
1 comment