BQSR bootstrapping for multiple-sample dataset with no known variants (non-human)
AnsweredHello, I am using GATK v 4.1.6.0. I have roughly 275 individual samples (plants) in single-end RAD-seq data, and I am a bit confused about the bootstrapping cycle steps involved in BQSR when there is no database of known variants to call.
I have run HaplotypeCaller once, uncalibrated, as recommended. Because I have so many samples, I ran them each using GVCF mode and then combined GVCFs and ran GenotypeGVCFS.
My plan was then to filter this VCF for "high confidence SNPS" (using FilterVCF from Picard, unless there is another way I am unaware of).
Should this resulting filtered VCF then be applied to BaseRecalibrator, one sample at a time?
My understanding is that after each sample has been through one round of BaseRecalibrator, ApplyBQSR should be applied to each sample, and HaplotypeCaller should be run on the resulting files. Should CombineGVCFs/GenotypeGVCFs also be done here, or should I be treating each sample completely independently?
I am also unclear on when the AnalyzeCovariates function should be used. It seems from the documentation that it needs at least two recalibration tables are needed (first pass and second pass), necessitating this process to be repeated twice before comparing any charts. To do this, would I then filter the VCF from the second round of HaplotypeCaller, and again apply this to each original BAM file in BaseRecalibrator?
To sum up, here are the steps as I understand them, with things unclear to me in parentheses:
1. Run HaplotypeCaller on each BAM file, followed by CombineGVCFs/GenotypeGVCFs (or should I simply output VCFs for each sample to filter, and wait to produce a final GVCF version until the end?)
2. Filter VCF (again, for each sample independently or the VCF resulting from joint genotyping?) using FilterVCF.
3. Apply filtered VCF to BaseRecalibrator for each BAMfile, one at a time.
4. ApplyBQSR for each BAM file, one at a time.
5. Run HaplotypeCaller on the resulting BAM files from 4 (and then CombineGVCFs/GenotypeGVCFs?)
6. filter VCF from step 5
7. use step 6 VCF to repeat step 3 on original BAM files.
8. Analyze covariates with the recalibration tables generated in steps 3 and 7 (pairwise for each sample?)
After this point, from which step do I loop to actually improve the base quality? Am I always returning to run a new filtered VCF on the original BAM files? Meaning, repeating from 3-8?
Am I missing any steps or tools?
Thank you for your help, I am new to using GATK and would very much appreciate clarification on these points!
Thank you,
Haley
-
Is a pity that nobody answered this question, I also want to create a know-site VCF file and I'm not really sure how to do it. Did you find out Haley Arnold?
-
Hi all,
Thanks for the follow up question on this post so that we can address it!
We don't have any current BQSR bootstrapping methods or recommendations for when there is no known sites file.
If you don't have a known sites file, you can still use GATK. Just skip the BQSR step and use hard filtering instead of VQSR. It's more ideal to be able to use the BQSR and VQSR machine learning steps, but it's not possible if you don't have a known sites file.
Hope this helps!
Genevieve
-
Hi ,
The GATK support team is focused on resolving questions about GATK tool-specific errors and abnormal results from the tools. For all other questions, such as this one, we are building a backlog to work through when we have the capacity.
Please continue to post your questions because we will be mining them for improvements to documentation, resources, and tools.
We cannot guarantee a reply, however, we ask other community members to help out if you know the answer.
For context, check out our support policy.
-
I would also greatly appreciate a response to this as there doesn't seem to be an answer for it anywhere on the Internet.
Please sign in to leave a comment.
4 comments