GATK22.214.171.124 HaplotypeCaller ERRORAnswered
Can you please provide
a) GATK version used :126.96.36.199
b) Exact GATK commands used : HaplotypeCaller
c) The entire error log if applicable. : java.lang.IllegalArgumentException: Unexpected base in allele bases 'ATTTTTCTGAATCCCTTTCAAATCAGGACAAGAACTAGAAATGTCTATACAGGTTTAATATGAAGTAAAGAAAATGTTTTTCATTTTCTTGATTTATTTCTGAATTCAGCTTGCTCTTCATTAGCGCTACATAGCTGMCTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAGGGTGCATATTTATTCACTAACTATGTTACAATCATGTGATCTGCTGGATTTTTTCTGATAGTCTACTCTAGATTTGTTCTAAATTAATAAA'
I apologize for the possibly stupid question I will ask. The problem is that I met the error "java.lang.IllegalArgumentException: Unexpected base in allele bases '...TGMCT...'" when I was using HaplotypeCaller to identity the SNPs and InDels. It is apparent that there is an 'M' in the bam file due to the existence of degenerate bases. I wonder how can I deal with or skip this region to continue calling process.
Besides that I come up with the same error when I use latest Version.
If that is the only area with an M, you can use the option -XL to exclude the interval from processing. However, if that is not the only issue, you can check out this documentation to diagnose the problem: https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile, then fix your file.
Thanks for your suggestion. I followed the instructions in this documentation: https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile.
Unfortunately, it returns "No errors found".
I wonder whether it is because that I use Novoalign to do the alignment. And have you met the same error before?
Hi WenyaWang, glad it found no other errors. You can look for non-GATK solutions to remove the M, or use the -XL option in HaplotypeCaller to exclude the region like I wrote above. Here is the documentation link.
Unfortunately we only provide solutions for GATK issues. But if someone in the community has seen this issue with Novoalign, please let us know!
Here is the reply from NovoAlign:
"Hi WenyaThanks for your email.Yes it does appear that GATK does not like the ambiguous IUPAC base 'M' that exists and it is a known issue that HaplotypeCaller does not support IUPAC codes. In our recommended workflow with Novoalign you could build an IUPAC reference novoindex and align your reads to that. However when you do variant calling or GATK you would use a regular reference FASTA (not a IUPAC one).If you are still seeing this error after following the workflow above you may need to manipulate your BAM file before running GATK to replace the M's with an A or C in the SAM SEQ column, and then run GATK to see if that works. Also don't forget the other IUPAC codes should also be replaced.Picard validation probably does not check the actual sequence in SAM format for IUPAC codes which is why you're seeing no validation errors."So the problem is generated by reference genome.
Thank you for updating this thread, WenyaWang, it will definitely be useful to other users!
I ran into the same problem and developed a quick fix for it, using Genozip, in case it useful for anyone else: https://genozip.readthedocs.io/gatk-unexpected-base.html
Divon Lan, thank you for providing this resource!
Please sign in to leave a comment.