output BAM form ApplyBQSR query using --interval option
Hi,
I am dealing with exome germline variant calling. As per https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists, we should always use --intervals (and --interval-padding) for limiting computational time where you can give your exon coordinates bed file as an input. I have used these parameters in BaseRecalibrator program. I have also used the same parameters in ApplyBQSR (--intervals and --interval-padding).
The resulted output BAM file from ApplyBQSR had lesser total reads from the original BAM input file used in BaseRecalibrator.For example, for one sample, total reads in the original input BAM file was 138,107,524. The ouput BAM file after running ApplyBQSR had 93,217,744 total reads. So, there was deficit of 44,889,780 reads, which is a huge number (almost 33% of reads lost).
Does ApplyBQSR remove reads outside the interval regions? Is it an expected behavior? Should we or should we not use the interval parameters in ApplyBQSR for exome variant discovert workflow?
I am asking this because when I didn't use the interval parameters in ApplyBQSR, the resulting BAM file total read numbers where exactly the same as the original input BAM file used in BaseRecalibrator.
Tool versions used: gatk-4.1.9.0, samtools flagstats v1.10 for getting BAM file stats
Commands used:
java -jar $TOOLKIT_PATH/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar BaseRecalibrator -I $BAM_PATH/"$sample"_sorted_markduped_readgroup.bam -R $REFERENCE_GENOME \
--known-sites $GOLD_VCF_PATH/1000G_phase3_biallelic_SNVs_indels.GRCh38_updated_chr.vcf.gz \
--known-sites $GOLD_VCF_PATH/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
--known-sites $GOLD_VCF_PATH/GCF_000001405.38_dbsnp-154_updated_chr.vcf.gz \
--intervals $INTERVAL_PATH/hg38_exome.bed \
--interval-padding 20 \
-O $BAM_PATH/"$sample"_recal1_data.table
java -jar $TOOLKIT_PATH/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar ApplyBQSR -I $BAM_PATH/"$sample"_sorted_markduped_readgroup.bam -R $REFERENCE_GENOME \
--bqsr-recal-file $BAM_PATH/"$sample"_recal1_data.table \
--create-output-bam-index true \
--intervals $INTERVAL_PATH/hg38_exome.bed \
--interval-padding 20 \
-O $BAM_PATH/"$sample"_sorted_markduped_readgroup_base_recal.bam
Regards,
Prasun
-
Hi,
I think this is what happened. After visualising a very small exonic portion from hg38, I observed that ApplyBQSR removed reads outside the exonic region. Please see the attached picture. However, I would still like to know if we should be using --interval and --interval-padding parameters in ApplyBQSR. Or, we don't use it in ApplyBQSR and directly use the same options in Haplotypecaller. I would just like to know your view on it from a best practice point of view.Regards,
Prasun -
Hi prasundutta87,
You should not use intervals with ApplyBQSR because you want to recalibrate all of the reads. You can introduce artifacts if you run with intervals in ApplyBQSR. There is an in depth discussion on the forum at this link: https://gatk.broadinstitute.org/hc/en-us/community/posts/360074603551-ValidateSamFile-New-error-after-running-BQSR
In the ApplyBQSR docs, the -L argument is not a recommended parameter.
Best,
Genevieve
-
Thanks a lot for this, Genevieve Brandt (she/her)! It was really helpful.
Please sign in to leave a comment.
3 comments