Data pre-processing for variant discovery - BaseRecalibrator, vcf input?Answered
How do I obtain vcf file in the BaseRecalibrator stage of the "Data pre-processing for variant discovery" protocol?
Data: Leave plant (non-human) tissue.
I am using GATK to obtain discovery of short variants for my Illumina HiSeq Xten based reads
for LEave plant cells. A reference genome is available.
I am at the "Data pre-processing for variant discovery" protocol.
Currently I have finished the MarkDuplicate stage (for my bwa mapped reads).
The next recommended stage is: BaseRecalibrator
However, BaseRecalibrator requires an input vcf file which I don't have yet at this stage of the "data pre-processing protocol?
After MarkDuplicate followed by sorting, I only have a sorted sam file with flags for duplications.
What am I missing here?
Thank you for your help,
Hi Arik, which files do you have and which are you missing? Please provide the specific command
Thank you for your fast response.
I am missing a vcf file.
These are the main stages and files I already ran:
1. Task: QA
Input: individualA_f.fq.gz , individualA_r.fq.gz derived from illumina Hiseq
Result: 2 html files showing sequences with duplicates and adapters.
2. Task: remove low QA sequences, and adapters.
input: individualA_f.fq.gz , individualA_r.fq.gz
output: individualA_f_paired.fq.gz , individualA_r_paried.fq.gz
individualA_f_unpaired.fq.gz , individualA_r_unparied.fq.gz
---- From here on protocol is based on GATK ---
3. Mapping (based on GATK)
Command: bwa index -a bwtsw Genome.fasta
output: indexing files for genome
Command: bwa mem -M -t 14 PlantGenome.fasta individualA_f_paired.fq , individualA_r_paried.fq > individualA_paired.sam
input: see in command
4. Task: Mark Duplicates
Command: picard.jar SortSam I=individualA_paired.sam O=individualA_paired_sorted.sam SORT_ORDER=coordinate
input: see in command
Command: picard.jar MarkDuplicates I=individualA_paired_sorted.sam O=individualA_paired_markDup.sam ASSUME_SORT_ORDER=coordinate M=individualA_markDuplic_matrix
input: see in command
output: individualA_paired_markDup.sam, individualA_markDuplic_matrix
4c. sort again
Same command as in 4a:
5. The next staage according to https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery is:
Base (Quality Score) Recalibration done with BaseRecalibrator and Apply Recalibration.
However BaseRecalibrator requires this argument:
However, I still don't have a vcf file at this stage of the protocol.
In other words: how do I make the sites_of_variation.vcf required by the --known-sites argument? In which stage it should have been created?
What am I am missing here?
Thank you very much for your hep,
One more question:
Looking at "https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery", it reads:
"Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)"
Does it mean the whole "Base (Quality Score) Recalibration" stage is an optional stage? Or just the AnalyzeCovariates stage?
If the whole stage is optional when does one removes duplicates? in the MarkDuplicates stage?
Hello Arik, the Base Quality Score Recalibration (BQSR) step is recommended to improve results, but not required to run an analysis using GATK. You need at least 100M bases per read group to get high quality results. Here is an overview in our documentation: Base Quality Score Recalibration (BQSR)
The file: --known-sites sites_of_variation.vcf is not created in your workflow, it is a VCF file containing the variants that have been confirmed for the species you are studying. Here is more information about where you can find those files: Where can I find known variants, training and truth sets, and other resource files
Thank you very much for the helpful response.
Glad we were able to help!
I have a similar situation as Arye; I am following the GATK best practices workflow for my non-human organism's data, and I am now in Step Base Quality Score Recalibration (BQSR), and yet the sites_of_variation.vcf file is not available.
I wonder if I should skip the BQSR step entirely? However, in the BaseRecalibrator description page, it says:
"These Read Filters are automatically applied to the data by the Engine before processing by BaseRecalibrator".
If I need to skip the BQSR step entirely, should I still apply these filters to remove duplicated and unmapped reads from my bam files before the variant calling step?
Hi Peiwen Li,
You don't need to apply those read filters without BQSR, many of those filters are applied with other steps in the best practices workflow.
Please sign in to leave a comment.