Error in document
AnsweredDear GATK team,
Hi I'm Oh.
There seems to be a problem with the tutorial.
I think the samples used in the cohort mode should include a sample used in the case.
However, the tutorial use 24 samples for cohort mode, and suddenly use "cohort-23wgs-20190213-contig-ploidy-model" when executing case mode.
Additionally, I have a question.
I have a sample of 200 disease group and 100 normal group. That is, I have a cohort of 300 samples.
What I ultimately want to do is compare the germline CNV of the disease group and the normal group.
In this case, please tell us which analysis method is more suitable.
Method 1)
Is it better to create a gCNV model in COHORT mode using 300 samples, and perform it individually for 200 disease groups and 100 normal groups using CASE MODE?
(1) COHORT mode using 300 samples, and
(2) CASE mode by one sample (n=300)
(3) Compare CNV between 200 disease group and 100 healthy group
Method 2)
Or, it is correct to perform COHORT gCNV using 300 samples and see the results.
(1) COHORT mode using 300 samples, and compare CNV between 200 disease group and 100 healthy group.
Many Thanks, Oh.
-
Official comment
Hi stella,
I don't think there is a mistake in the tutorial, there is an explanation for what you are seeing in the tutorial:
The tutorial provides example small WGS data sourced from the 1000 Genomes Project. Cohort mode illustrations use 24 samples, while case mode illustrations analyze one sample against a cohort model made from the remaining 23 samples. The tutorial uses a fraction of the workflow's recommended hundred samples for ease of illustration.
I'll see if I can find recommendations for your use case but we don't guarantee specific solutions for users. If anyone on the forum has thoughts, please chime in!
Please let me know if you have further questions I can help with.
Best,
Genevieve
Comment actions -
Hi ,
I am going to move your post into our Community Discussions -> General Discussion topic, as the Germline topic is for reporting bugs and issues with the GATK tools.
You can read more about our forum guidelines and the topics here: Forum Guidelines.
Thanks for the heads-up on the documentation error. We will tag the documentation team and make the necessary change.
Best,
Bhanu
-
Hi Oh,
I was able to look into further your methods and have some recommendations:
The gCNV method is not like the somatic CNV method, the cohort mode does call CNVs for samples. There is no reason to run method 1 calling samples both with cohort and case mode.
In your case we would recommend 300 samples in cohort mode. However our general maximum we recommend is 200 samples so 300 might take too long to finish. If that is the case, build a cohort with 200 samples that are half male and half female (it doesn't matter which are diseased and normal). Run the rest of your samples with case mode.
Best,
Genevieve
-
Thanks for your response.
I'll just make three more clear.
Q1. Is this the way you said it?
1) Run 200 in cohort mode to get model and VCF files
2) (Of 300 samples, remaining 100 samples) analyze one sample against a cohort model made from the 200 samples.
3) Is it correct to combine 100 VCF files obtained by performing case mode 100 times and 200 VCF files obtained in 1) and perform comparison between disease and control?
Q2. Suppose I have spare time and resources.
Can I analyze it like this?
1) Perform cohort mode with 300 samples and get VCF
2) Merge 300 VCF files and compare disease and control.
Q3. I saw your tutorial and thought so.
Using 300 samples to make a model by performing cohort mode, and applying the model to each sample (n=300) to perform case mode.
Now it's clear.
Only samples that were not made in cohort mode can be run in Case mode.
Right?
Thank you.
Oh.
-
Hi Oh,
Q1)
- Yes
- This matches what I meant. We have a WDL that takes the full set of 100 samples and scatters the job to make it much easier.
- Yes, we have WDLs to combine the VCFs in a clever way because it is more than just combining the VCFs. The breaks from the different files won't necessarily match and the script also annotates with site frequency counts as well.
Q2)
- Yes, you can use the joint calling WDL for both these steps
Q3)
- We wouldn't recommend this method. Although there is no check to make sure the samples are not in the cohort, it would not be a good idea.
Hope this helps!
Genevieve -
Thanks for your helpful answer. :)
Oh
-
No problem!
Please sign in to leave a comment.
7 comments