GenotypeGVCFs after GenomicsDBImport reports reference file errors on specific intervals where a chr name changes
I am using gatk/4.4.0.0 with the following script:
# script heavily influenced by Josie Paris's github https://github.com/josieparis/gatk-snp-calling
module load StdEnv/2023 gatk/4.4.0.0
module load gcc/12.3 bcftools/1.19
set -e
### Script to combine gVCFs and then call genotypes across all samples
output_vcfs="/home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/vcfs/intervalVCFs_splitBed"
reference="/home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/refs/Dfir_twistTargets.fasta"
dict_file="/home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/refs/Dfir_twistTargets.dict"
interval_dir="/home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/refs" # formerly bed_files
gvcfs="/home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/gvcfs"
# Ensure output directory exists
mkdir -p $output_vcfs
# Make sure the reference exists
if [ ! -f "$reference" ]; then
echo "Error: Reference genome $reference not found!"
exit 1
fi
# Dataset name
DATASET=Dfir
# Fetch the intervals of interest for the current array task
bedfile="interval_batch_${SLURM_ARRAY_TASK_ID}.bed"
bed="$interval_dir/$bedfile"
if [ ! -f "$bed" ]; then
echo "Error: Interval BED file $bed not found!"
exit 1
fi
# Ensure GenomicsDB output directory exists
interval_dbs="/home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/vcfs/interval_dbs_splitBed"
mkdir -p $interval_dbs
# Remove previous database if exists
# rm -rf "$interval_dbs/${bedfile}_db"
# Gather gVCF input files
gvcf_in=($gvcfs/*.edit.g.vcf.gz)
if [ ${#gvcf_in[@]} -eq 0 ]; then
echo "Error: No input gVCF files found!"
exit 1
fi
cmd=""
for file in "${gvcf_in[@]}"; do
cmd+="--variant $file "
done
# Debugging info
echo "Running on intervals: $bedfile"
echo "Using reference: $reference"
echo "Intervals DB Path: $interval_dbs/${bedfile}_db"
# Make database - finished
# gatk --java-options "-Xmx20g -Xms12g" GenomicsDBImport \
# $cmd \
# --genomicsdb-workspace-path "$interval_dbs/${bedfile}_db" \
# --reader-threads 2 \
# -L "$bed"
# Genotype
#gatk --java-options "-Xmx20g -Xms12g" GenotypeGVCFs \
# -R $reference \
# -V gendb://$interval_dbs/${bedfile}_db \
# -O "$output_vcfs/${bedfile}.vcf.gz"
# testing genotyping for intervals 42 & 43 with jcf regions
gatk --java-options "-Xmx32g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenotypeGVCFs \
-R $reference \
-V gendb:///home/an-rpeery/scratch/Dfir_xpgwas/SNP_calling/vcfs/interval_dbs_splitBed/interval_batch_43.bed_db \
-O "$output_vcfs/${bedfile}.vcf.gz"
-L "$bed"
# https://github.com/broadinstitute/gatk/issues/8415
My slurm error report from GenomicsDB finishing until the failure for interval 42:
***********************************************************************
A USER ERROR has occurred: Bad input: We encountered a non-standard non-IUPAC base in the provided input sequence: '50'
***********************************************************************
My full error report from GenomicsDB finishing until the failure for interval 43:
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.001866803,Cpu time(s),0.001846652
[March 20, 2025 at 7:00:52 p.m. UTC] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=1157627904
***********************************************************************
A USER ERROR has occurred: Bad input: We encountered a non-standard non-IUPAC base in the provided input sequence: '10'
***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException$BadInput: Bad input: We encountered a non-standard non-IUPAC base in the provided input sequence: '10'
at org.broadinstitute.hellbender.utils.BaseUtils.convertIUPACtoN(BaseUtils.java:133)
at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.getSubsequenceAt(CachingIndexedFastaSequenceFile.java:344)
at org.broadinstitute.hellbender.engine.ReferenceFileSource.queryAndPrefetch(ReferenceFileSource.java:78)
at org.broadinstitute.hellbender.engine.ReferenceDataSource.queryAndPrefetch(ReferenceDataSource.java:64)
at org.broadinstitute.hellbender.engine.ReferenceContext.getBases(ReferenceContext.java:201)
at org.broadinstitute.hellbender.engine.ReferenceContext.getBase(ReferenceContext.java:411)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:136)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:316)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:601)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
I keep getting format errors and I have tried all the tricks posted in these forums for finding and fixing "-" in the reference (does not occur) and changing line endings. Why would the first 41 intervals work, interval 42 works for fasta files named with this pattern: PSME\d+ but fails at this pattern jcf\d+_\d+_\d+. The jcf pattern is the only pattern in interval 43 which fails entirely.
-
Hi rpeery
This is definitely a reference issue which needs to be fixed on your end.
Also the problem is not quite clear to us. What do you mean by Fasta files named with patterns? Are you using only a single reference file or do you have multiple references?
Regards.
-
Dear Gökalp,
Thank you for your reply. I understand that the error is pointing to the reference. However, I am at a loss as to what the issue may be since I have replaced all line endings (and used dos2unix) and checked them with cat -v as well as other tools and looked for any errant "-" symbols with grep. They do not exist. I also do not have any non-IUPAC codes. I've remade the reference files after searching as well with CreateSequenceDictionary and samtools faidx commands. The gvcf files are fine - why does it break at the genotypeGVCF stage?
My reference is a multifasta file of 4223 regions from exome capture. I am then using the L flag to run 100 regions at a time to speed up processing. This generates 43 intervals to process. The 42nd interval contains a naming convention change following the pattern that I described in my initial post. When that changes the script fails, but the first 95% of the interval (#42) is processed correctly as well as intervals 1-41, and the last interval (#43) is not processed at all. This seems to point to a specific location within the reference to investigate. Therefore, I pulled these last 30 regions as a test and run that separately, meaning I made a fasta file, dict file, fai file, and bed file for the interval and that runs through genotypeGVCFs fine. So, it seems to be something about switching naming conventions in the multifasta file (???), which is very odd. I did not change anything about the formatting of the test files except that they only contain the 30 regions that are named differently using underscores.
The easiest solution is probably to merge this test run with the previous intervals in the gather stage. I guess I'm looking for additional advice beyond what has been previously posted for looking at a reference to see where the error might be popping up and as a new user, I'd like to understand why it only occurs at this stage and not in generating the gvcf files.
This is what one of the interval file looks like - again working fine for intervals 1-41 and 95% of interval 42:
$ cat -vT ../refs/interval_batch_43.bed | head
jcf7190000012653_16019_16332^I0^I313
jcf7190000017161_425650_434705^I0^I9055
jcf7190000020666_546893_547206^I0^I313
jcf7190000022858_2311_2748^I0^I437
jcf7190000025510_493657_518980^I0^I25323
jcf7190000031468_367518_368362^I0^I844
jcf7190000033566_323114_323244^I0^I130
jcf7190000037928_60698_61011^I0^I313
jcf7190000043848_135677_136666^I0^I989
jcf7190000044955_68566_69128^I0^I562If I look for the offending "-" in the reference (error for interval 42) with this code and many other permutations of it including counting characters I do not see any issues:
$ grep -v '>' ../refs/Dfir_twistTargets.fasta | tr -d "\n\r ATGCNatgcn"
If I look at line endings (error for interval 43) everything looks fine:
$ file Dfir_twistTargets.fasta
Dfir_twistTargets.fasta: ASCII text, with very long lines (2952)$ cat -e Dfir_twistTargets.fasta | grep -A 1 jcf
All the line endings are "$" as expected.
Any suggestions and clarifications welcome!
-
Hi again.
Is it possible for you to post your fasta file and interval files in a single archive file using our issue submission ftp? We might want to take a look at that in detail.
https://gatk.broadinstitute.org/hc/en-us/articles/360035889671-How-do-I-submit-a-detailed-bug-report
Regards.
-
Sorry for the delay - firewalls am I right?
The files are uploaded as nameChangeUseCase.tar.gz. Thank you! -
Hi rpeery
Looks like it was a simple fasta index issue. Once you modified your fasta file you should reindex and create new dictionary using samtools faidx/gatk IndexFeatureFile and samtools dict/gatk CreateSequenceDictionary tools. I was able to run your commands with a fresh dictionary and index without any issues.
Regards.
-
Thank you for your time looking into this. I had remade the dict and fai files several times, but the 7th time was the magic one (of course after reaching out). I am still curious why this flags at this stage (GenotypeGVCFs) and not earlier if there is an error in formatting for the reference files. Are the dict and fai files not accessed in the HaplotypeCaller stage?
-
Hi again.
I would certainly check for VCFs for validation as there may be unforeseen issues hanging within. Our tools depend on index and dictionaries however the way they consume them may show differences.
I hope this helps.
Regards.
Please sign in to leave a comment.
7 comments