ValidateSamFile: New error after running BQSR
Dear GATK Team,
I am following the guidelines in the following documentation for data preprocessing: Data pre-processing for variant discovery and (How to) Map and clean up short read sequence data efficiently.
Picard version: 2.23.8. GATK version: 4.1.9.0.
When processing the output BAM after running ApplyBQSR through ValidateSamFile, I am receiving the following error:
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_NOT_FOUND 64152
However, I confirm the BAM showed no errors with ValidateSamFile prior to running BaseRecalibrator and ApplyBQSR and both BaseRecalibrator and ApplyBQSR ran successfully. FixMateInformation does not resolve this issue.
Why do I see a new error in the BAM after running BQSR? Could it be due to the default read filters differing between BaseRecalibrator and ApplyBQSR or alternatively, another read filter being required?
Thank you for your time and help.
Kind regards.
-
Hello ISmolicz, did you run ValidateSamFile using MODE=VERBOSE? The mate information may be appearing as warnings, which would not show up as an error. Follow this tutorial to get more information.
-
Thank you for your reply Genevieve Brandt.
Yes, I did run with MODE=VERBOSE and the issue appeared as an error. A general example:
ERROR::MATE_NOT_FOUND:Read name ..., Mate not found for paired read
Would you have any thoughts as to why this error is suddenly appearing post-BQSR?
-
Did you not get this error message with ValidateSamFile before BQSR? Could you also provide your BQSR commands?
-
No, there were no errors with ValidateSamFile before BQSR.
My BQSR commands are the following:
gatk BaseRecalibrator \
--input input.bam \
-R ucsc.hg19.fasta \
-L intervals.bed \
--interval-padding 100 \
--known-sites dbsnp_138.hg19.vcf \
--known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
--known-sites 1000G_phase1.indels.hg19.sites.vcf \
--output sample.recal_data.grp \
--tmp-dir $TMPDIRgatk ApplyBQSR \
--input input.bam \
-R ucsc.hg19.fasta \
-L intervals.bed \
--interval-padding 100 \
--bqsr-recal-file sample.recal_data.grp \
--emit-original-quals true \
--output output.bam \
--tmp-dir $TMPDIRThe reference and --known-sites resources are from the hg19 resource bundle provided by GATK.
Thank you again.
-
ISmolicz could you please also confirm if there were any warnings with ValidateSamFile in MODE=VERBOSE?
-
There are no warnings Genevieve Brandt. Only the error mentioned in my earlier post is shown repeatedly.
-
Hi ISmolicz, thank you for clarifying. I brought this up with the GATK team and we think that the issue is using intervals with BQSR. Reads are only kept if they overlap the selected interval and will result in mates not being carried over (if they are outside of the interval or unmapped). Especially -L should not be used with ApplyBQSR because it needs to happen on everything, even the unmapped reads.
So, unless you are using -L to subset the bam for job scattering purposes, I would recommend not using -L at this point in your analysis in order to keep mates together.
-
Thank you for the explanation Genevieve Brandt and for discussing with the GATK team. I will re-run BQSR without -L and check the output with ValidateSamFile again.
Please may I ask whether it is recommended to exclude -L from BaseRecalibrator as well as ApplyBQSR?
In addition, are there any other notable GATK tools or processes that -L should not be used with e.g. process for creating panel of normals?
Thank you again.
-
ISmolicz you should not use -L while running the BQSR steps.
In terms of -L in other tools, we do not have any documentation regarding specific rules. Here is our intervals documentation providing some use cases where intervals would be helpful.
-
Thank you for your reply Genevieve Brandt. To confirm, no errors were found when running the output BAM through ValidateSamFile after running BQSR without intervals specified.
I would just like to highlight a statement read in the following GATK documentation: When should I restrict my analysis to specific intervals?
In addition, there are some processing steps, such as BQSR, that should be restricted to the capture targets in order to eliminate off-target sequencing data, which is uninformative and is a source of noise.
I understand recommendations change but if -L is not being applied, what effect could off-target sequencing data have on BQSR?
Thank you again.
-
Hi ISmolicz the statement you highlighted from our intervals documentation is regarding exomes, are you analyzing exomes or doing targeted sequencing?
-
I am doing targeted sequencing Genevieve Brandt.
-
Thank you for that info, I'll look into this further and let you know what I find.
-
Hi ISmolicz, I re-read the intervals document with my team and we realized it was not completely clear so we are going to make a note to edit it. Thank you for pointing that out!
In general, using intervals (-L) can introduce artifacts if you choose the intervals unwisely. Reads will get discarded that are outside the interval when the reads may have been used in making variant calls (for example, lost mates). When we use intervals in our production pipelines with targeted sequencing, we make sure to give padding around the targeted sites, 100 bp on each side. There are many GATK tools where you have to be careful about interval usage because it can change the result, depending on how you use intervals.
With BQSR, intervals can be used while running BaseRecalibrator but not ApplyBQSR. If you use intervals, you will need to add an extra step after BaseRecalibrator to combine all the recalibration tables. ApplyBQSR is meant to be run on all the reads that contribute to the model so it cannot be run using intervals.
When we run BQSR in our production pipelines, we use intervals to parallelize the analysis, not for subsetting. Any subsetting of your file can introduce artifacts (like the MATE_NOT_FOUND error you are getting).
I hope this provides some clarity on how to use intervals with BQSR. Please let me know if any of it does not make sense.
-
Thank you for looking into this topic further Genevieve Brandt. I confirm that I also give padding of 100bp when using intervals when analysing targeted sequencing data.
Please could you help with clarifying a couple of points following your reply:
1. I am applying BQSR per-sample as recommended in the Data pre-processing for variant discovery workflow; per-read group BAMs for the same sample were merged in the mark duplicates step.
If it were to run BaseRecalibrator once per sample with one interval list, would I need to combine all recalibration tables from different samples or is this required only when intervals are being used for parallelisation (e.g. running BQSR separately on different chromosomes for the same sample)?
2. If the purpose of intervals is for subsetting rather than parallelisation (e.g. by providing one list of intervals covered by a capture panel with targeted sequencing), is it recommended? If yes, how can artefacts be accounted for downstream?
Thank you again!
-
Hi ISmolicz,
1. BQSR should be run per sample which means in this case, do not combine recalibration tables from different samples.
2. Run BQSR on all regions with data. You want to include all of your data in BQSR.
-
Thank you Genevieve Brandt
Unfortunately, I am still not fully clear on the recommendation. I now understand intervals can be used with BaseRecalibrator only and this can eliminate off-target sequencing data. However, it can introduce artefacts. It seems that one must balance off-target sequencing data vs. artefacts e.g. MATE_NOT_FOUND.
- Is off-target sequencing data or artefact considered to be worse and can either ultimately be eliminated?
- From my interpretation of point 2 in your most recent post above, if all data is to be included, then no intervals should be supplied for the purpose of subsetting in either BaseRecalibrator or ApplyBQSR in the context of targeted sequencing. Is this correct?
I also see that an interval list is not supplied in the BaseRecalibrator example in the GATK documentation.
On a related note, you mention that one has “to be careful about interval usage because it can change the result”. As a suggestion, maybe there could be further clarification in GATK documentation in the future on when intervals should be used for parallelisation vs. subsetting with key tools in Best Practices pipelines?
Thank you for all your time and help!
-
Hi ISmolicz, I am not able to provide more insight to these questions because there is no general answer for what is best in every case. We provide general recommendations on our site and try to get more specific when possible, but we do not specialize in every type of data or use case.
Your best bet for more information is to see what we recommend but also look at other researchers who are doing similar analysis to you and decide on a method that works for your research. You can also interact with community members on our forum or other forums.
I will put a note in for our documentation team to look further into interval usage and see if we can provide more specifics into what we recommend.
-
Ok Genevieve Brandt, I understand. Thank you for all your time and help with my questions.
Please sign in to leave a comment.
19 comments