Discrepancy in Variants called when split by chromosomes and not split by chromosomes
a) GATK version used: 4.6
I have built a pipeline following GATK best practices workflow to call rare germline variants.
I wanted to I wanted to reduce time and so I split the processing at chromosome level - base recalibration(BQSR) and Haplotype Caller for each chromosome followed by concatenation.
When I compared the final concatenated vcf with the one I had processed as a whole, I was able to see that there was a 0.2 ~ 0.5 % difference in the variants called.
Also for the common variants there was a difference in the genotype for a few of them. I have pasted a handful below:
CHROM_vcf1 | POS_vcf1 | REF_vcf1 | ALT_vcf1 | FILTER_vcf1 | GT_vcf1 | CHROM_vcf2 | POS_vcf2 | REF_vcf2 | ALT_vcf2 | FILTER_vcf2 | GT_vcf2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | |
chr8 | 94857071 | CTT | CT | PASS | 0/1 | chr8 | 94857071 | CTT | CT | PASS | 0/2 | |
chr8 | 94857071 | CTT | C | PASS | 0/1 | chr8 | 94857071 | CTT | C | PASS | 0/2 | |
chr8 | 132907228 | G | A | PASS | 0/1 | chr8 | 132907228 | G | A | PASS | 1/1 | |
chrX | 44334764 | G | A | PASS | 0/1 | chrX | 44334764 | G | A | PASS | 1/1 | |
chrX | 153911252 | CTTTT | C | PASS | 0/2 | chrX | 153911252 | CTTTT | C | PASS | 0/1 | |
chrX | 153911252 | CTTTT | CT | PASS | 0/2 | chrX | 153911252 | CTTTT | CT | PASS | 0/1 |
Is this discrepancy expected?
Does base recalibration and/or Haplotype Caller depend on bases across contigs?
Edit: When I did base recalibration without splitting the chromosomes, but split only at haplotype caller stage, the vcf files are a 100% match. So I can see that base recalibration is not specific to contigs and uses data across chromosomes. Is this right?
-
Hi Ramesh
You can run BaseRecalibrator on individual chromosomes and combine all recalibration tables to generate a single one for best recalibration results. Differences in BQSR may result in changes in basecall values therefore variant calls may show differences among each run.
I hope this helps.
-
Hi Gökalp Çelik,
Thank you for reaching back.
So essentially, I can run
1. gatk BaseRecalibrator - for individual chromosomes
2. combine the recal tables into one
3. Run ApplyBQSR for individual chromosomes using same recal table?
I will test the combining of recal table and get back to you.
Thanks again!
-
Hi again.
Answer is yes. Your results should be equivalent once you have the same recalibration table for all contigs.
Regards.
-
Thank you so much Gökalp Çelik!
I used GatherBQSRReports for combining and the final result is exactly the same now.
Please sign in to leave a comment.
4 comments