Hello, I am using GATK v 220.127.116.11. I have roughly 275 individual samples (plants) in single-end RAD-seq data, and I am a bit confused about the bootstrapping cycle steps involved in BQSR when there is no database of known variants to call.
I have run HaplotypeCaller once, uncalibrated, as recommended. Because I have so many samples, I ran them each using GVCF mode and then combined GVCFs and ran GenotypeGVCFS.
My plan was then to filter this VCF for "high confidence SNPS" (using FilterVCF from Picard, unless there is another way I am unaware of).
Should this resulting filtered VCF then be applied to BaseRecalibrator, one sample at a time?
My understanding is that after each sample has been through one round of BaseRecalibrator, ApplyBQSR should be applied to each sample, and HaplotypeCaller should be run on the resulting files. Should CombineGVCFs/GenotypeGVCFs also be done here, or should I be treating each sample completely independently?
I am also unclear on when the AnalyzeCovariates function should be used. It seems from the documentation that it needs at least two recalibration tables are needed (first pass and second pass), necessitating this process to be repeated twice before comparing any charts. To do this, would I then filter the VCF from the second round of HaplotypeCaller, and again apply this to each original BAM file in BaseRecalibrator?
To sum up, here are the steps as I understand them, with things unclear to me in parentheses:
1. Run HaplotypeCaller on each BAM file, followed by CombineGVCFs/GenotypeGVCFs (or should I simply output VCFs for each sample to filter, and wait to produce a final GVCF version until the end?)
2. Filter VCF (again, for each sample independently or the VCF resulting from joint genotyping?) using FilterVCF.
3. Apply filtered VCF to BaseRecalibrator for each BAMfile, one at a time.
4. ApplyBQSR for each BAM file, one at a time.
5. Run HaplotypeCaller on the resulting BAM files from 4 (and then CombineGVCFs/GenotypeGVCFs?)
6. filter VCF from step 5
7. use step 6 VCF to repeat step 3 on original BAM files.
8. Analyze covariates with the recalibration tables generated in steps 3 and 7 (pairwise for each sample?)
After this point, from which step do I loop to actually improve the base quality? Am I always returning to run a new filtered VCF on the original BAM files? Meaning, repeating from 3-8?
Am I missing any steps or tools?
Thank you for your help, I am new to using GATK and would very much appreciate clarification on these points!
Please sign in to leave a comment.