Issue with GenotypegVCFS after Genomics DB import, building vs updating
We built a pipeline for joint calling of WGS samples and were successfully able to run it for chr 13-22. I’ve been having issues with the Genomics DB import and the genotype gvcfs step for chr 12 and will for the rest of the chromosomes due to the size of the gVCFs .
For chr 13 and above what we did was build a database of all gVCFs by importing at once as there was enough space on the physical node that the job was sent to. We transferred to the node because there are a significant amount of I/O operations thats too intensive for /projects. For chr 12 the data exceeds the physical storage limit of the node, so instead of importing all gVCFs at once, I imported (--genomicsdb-workspace-path) a subset and updated (--genomicsdb-update-workspace-path) the databases in groups in order to manage how many gVCFs were physically stored on the node at once. The genomics db import step worked fine. The issue happened for the next step, genotype gVCFs, where this error came up:
“A USER ERROR has occurred: Bad input: Presence of '-RAW_MQ' annotation is detected. This GATK version expects key RAW_MQandDP with a tuple of sum of squared MQ values and total reads over variant genotypes as the value. This could indicate that the provided input was produced with an older version of GATK. Use the argument '--allow-old-rms-mapping-quality-annotation-data' to override and attempt the deprecated MQ calculation. There may be differences in how newer GATK versions calculate DP and MQ that may result in worse MQ results. Use at your own risk.”
This is true, some gvcfs were created using older versions of GATK, ex: 1000 Genomes. We’ve been using GATK 4.4.0 for the joint calling pipeline. But if this was really a problem, then I would have had issues for chr 13-22 which worked fine for importing all at once (--genomicsdb-workspace-path). I looked around at other related posts on the forum but none quite helped in this situation. As a test, I ran the build and update script on chr 22 and it failed. Building a database for chr 22 worked fine, there wasnt an issue until I tried to update the database. So theres definitely an issue with build and updating databases in regards for genotype gvcfs.
Any ideas as to why this would happen only for updating databases?
b) Exact command used for chr 12:
1) Build and update for genomics DB:
time apptainer exec \
--bind ${TEMP_DIR}:/mnt ${TEMP_DIR}/gatk.sif \
gatk --java-options "-Xmx110g -XX:ParallelGCThreads=64" GenomicsDBImport \
--sample-name-map /mnt/freeze2_sample_map_chr${chr}_study1_1407IDs_scratch.txt \
--genomicsdb-vcf-buffer-size 16384000 \
--genomicsdb-shared-posixfs-optimizations true \
--bypass-feature-reader \
--batch-size 50 \
--reader-threads 5 \
--genomicsdb-workspace-path /mnt/output/GenomicsDBImport/study_samples_${interval} \
--tmp-dir /mnt/output/GenomicsDBImport/freeze2/ -L ${interval}
echo $?
time apptainer exec \
--bind ${TEMP_DIR}:/mnt ${TEMP_DIR}/gatk.sif \
gatk --java-options "-Xmx110g -XX:ParallelGCThreads=64" GenomicsDBImport \
--sample-name-map /mnt/freeze2_sample_map_chr${chr}_study2_909IDs_scratch.txt \
--genomicsdb-vcf-buffer-size 16384000 \
--genomicsdb-shared-posixfs-optimizations true \
--bypass-feature-reader \
--batch-size 50 \
--reader-threads 5 \
--genomicsdb-update-workspace-path /mnt/output/GenomicsDBImport/study_samples_${interval} \
--tmp-dir /mnt/output/GenomicsDBImport/freeze2/ -L ${interval}
echo $?
time apptainer exec \
--env TILEDB_DISABLE_FILE_LOCKING=1\
--bind ${TEMP_DIR}:/mnt ${TEMP_DIR}/gatk.sif\
gatk--java-options "-Xmx1800g "GenotypeGVCFs\
-R /mnt/hs38DH.fa\
-V gendb:///mnt/study_samples_${interval} \
--max-alternate-alleles 40\
--max-genotype-count 2048\
-O /mnt/output/GenotypegVCFs/chr${chr}/freeze2_${interval}.vcf.gz\
--tmp-dir /mnt/output/GenotypegVCFs/temp_step3/
echo $?
Thank you!
-
Can you share your error messages from your logs so we can try to pinpoint the exact issue?
Regards.
-
Hi @Gökalp Çelik ,
Thank you, here was the error message as mentioned earlier in the post:
"A USER ERROR has occurred: Bad input: Presence of '-RAW_MQ' annotation is detected. This GATK version expects key RAW_MQandDP with a tuple of sum of squared MQ values and total reads over variant genotypes as the value. This could indicate that the provided input was produced with an older version of GATK. Use the argument '--allow-old-rms-mapping-quality-annotation-data' to override and attempt the deprecated MQ calculation. There may be differences in how newer GATK versions calculate DP and MQ that may result in worse MQ results. Use at your own risk."
-
Do all each of your samples have all variants within their respective VCF or are they separated per chromosome?
Can you check your VCF headers using
bcftools view -h
to find out which VCFs have the new RAW_MQandDP field and which ones have split RAW_MQ, DP fields?
GATK tools starting version 4.1 and on uses RAW_MQandDP INFO field. If at least one of your sample is called using a different version using either of these fields then GenotypeGVCFs will throw that "User Error" message and the only ways to overcome this issue is to use the
--allow-old-rms-mapping-quality-annotation-data
parameter or recall those samples to match INFO fields properly.
I hope this helps.
-
Hi Gökalp Çelik
Each individual sample has their VCFs split by chromosome. I looked into some of the samples, and yes it is 1000 genomes that has the split RAW_MQ and DP fields. The other samples have the RAW_MQandDP field.
Sure I can use the flag
--allow-old-rms-mapping-quality-annotation-data
But do you know what the risk is, as the original error mentions to ‘use at your own risk’.
Thank you!
-
Older versions of GATK (Prior to 4.1) produces slightly different numbers for RAW_MQ and DP (due to fixes and changes in the local reassembly code) values but nothing substantially different to make a huge change in your actual calculations. In fact looking at the codebase for this particular option there are some samples within gnomAD that also has the old INFO tags therefore I don't think you will be facing any issues unless some of those samples were called with a GATK version too old such as pre 4.0.0.0. Below image shows one such site from a sample VCF.
As you can see call is still the same but numbers are slightly different. Nothing too substantial.
If you still wish not to use this parameter you may need to revisit those samples and call variants again with a recent version of GATK.
Regards.
-
Hi Gökalp Çelik
Its reassuring to hear that there aren't substantial changes in the outputs. Thank you so much again!
Please sign in to leave a comment.
6 comments