The problem
You get an error like this:
SAM/BAM/CRAM file <filename> appears to be using the wrong encoding for quality scores
Why this happens
The standard format for quality score encodings is that Q0 == ASCII 33 according to the SAM specification. However, in some datasets (including older Illumina data), encoding starts at ASCII 64. This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off.
To prevent this from happening, the GATK engine performs a sanity check of the quality score encodings that will abort the program run if they are not standard, and output the error message shown above.
Solution
If this happens to you, you'll need to run again with the flag [ --fix_misencoded_quality_scores
/ -fixMisencodedQuals
]. What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.
Note that the argument names in this article have not yet been updated for GATK4. Let us know if you run into problems and we'll fix them.
Related problems
In some cases the data contains a mix of encodings (which is likely to arise if you're passing in a lot of different files from different sources together), and the GATK can't automatically compensate for that. There is an argument you can use to override this check: [-allowPotentiallyMisencodedQuals
/ --allow_potentially_misencoded_quality_scores
]; but you use it at your own risk. We strongly encourage you to check the encodings of your files rather than use this option.
1 comment
Hi,
I use a program called MTBseq to analysis sequences of mycobacteria. The error I got was this;
ERROR MESSAGE: SAM/BAM/CRAM file htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter@e584b2b appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 66. Please see https://software.broadinstitute.org/gatk/documentation/article?id=6470 for more details and options related to this error.
After looking above I assume I add those commands to my script as follows? Im not command line savvy and very much a novice. I run MTBseq on a cluster and use slurm sbatch to run. Can I just add the lines above to my scrip as shown below?
Cheers,
Peter
#!/bin/sh
SLURM Commands
#SBATCH --partition=ProdQ
#SBATCH --nodes=1
#SBATCH --time=24:00:00
#SBATCH --job-name=C10
#SBATCH --account=ndlif075c
#SBATCH --output=TBfull_Log.txt
##SBATCH --mail-user=xxxxxxxxxxxxxxxx
##SBATCH --mail-type=BEGIN,END
cd $SLURM_SUBMIT_DIR
load the environment module
module load conda/2
load the conda environment
source activate MTBseq
BASH Commands
MTBseq --step TBfull --distance 5 --fix_misencoded_quality_scores -fixMisencodedQuals --threads 8
Please sign in to leave a comment.