GATK v4.2.4.0 VariantAnnotator running too slow
AnsweredHi all,
I am running the latest GATK v4.2.4.0 VariantAnnotator, but similar to the problem mentioned in this post, the running speed is very slow (~5000 sites per min). The development team mentioned in that post that a bug has been fixed in v4.1.7.0 so the next version should run VariantAnnotation faster. But it seems not the case. Here is the log file:
Using GATK jar /lustre/Anson/tools/GenomeAnalysisTK-4.2.0/gatk-package-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8G -jar /lustre/Anson/tools/GenomeAnalysisTK-4.2.0/gatk-package- VariantAnnotator -I /lustre/Anson/working/PGPC_0016/gatk4_hg38/02var/diffRG/PGPC0016.RC.clipper.bam -R /lustre/Anson/ref/hg38/GRCh38.primary_assembly.genome.fa -A BaseQualityRankSumTest -A ClippingRankSumTest -AX Coverage -A FisherStrand -A LikelihoodRankSumTest -A MappingQualityRankSumTest -A MappingQualityZero -A QualByDepth -A ReadPosRankSumTest -A TandemRepeat --dbsnp /lustre/Anson/ref/hg38/Homo_sapiens_assembly38.dbsnp138.vcf --tmp-dir /lustre/Anson/working/PGPC_0016/gatk4_hg38/tmp -V /lustre/Anson/working/PGPC_0016/gatk4_hg38/02var/diffRG/HC_split/0016.raw.g.vcf -L /lustre/Anson/ref/hg38/wgs_calling_regions_split/0016-scattered.interval_list -O /lustre/Anson/working/PGPC_0016/gatk4_hg38/02var/diffRG/refine_split_try/0016.ReAnno.g.vcf
Hi Anson Wong,
Thank you for the report! Could you give more details about your VCF? How many samples do you have?
Hi Genevieve Brandt (she/her),
Thank you for your help.
I did HaplotypeCaller for one WGS sample before running VariantAnnotator. I split the wgs_calling_regions.hg38.interval_list (provided by GATK) into 20 lists using GATK SplitIntervals, and ran HaplotypeCaller (HC) in parallel to speed up the process.
It took ~4 hours to complete all 20 runs. The outputs are 20 vcf, namely 0000.raw.g.vcf to 0019.raw.g.vcf. Then I tried to run VariantAnnotator for each of the vcf.
Here is the log file for one of the 20 HC runs (using 0016-scattered.interval_list). Due to the word limitation in comments I skipped the logs between 19:00 and 22:30:
Using GATK jar /lustre/Anson/tools/GenomeAnalysisTK-4.2.0/gatk-package-
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4G -jar /lustre/Anson/tools/GenomeAnalysisTK-4.2.0/gatk-package- HaplotypeCaller --native-pair-hmm-threads 16 -I /lustre/Anson/working/PGPC_0016/gatk4_hg38/02var/diffRG/PGPC0016.RC.clipper.bam --sample-name PGPC0016 -ERC GVCF -R /lustre/Anson/ref/hg38/GRCh38.primary_assembly.genome.fa --dbsnp /lustre/Anson/ref/hg38/Homo_sapiens_assembly38.dbsnp138.vcf -stand-call-conf 10.0 --min-base-quality-score 20 --tmp-dir /lustre/Anson/working/PGPC_0016/gatk4_hg38/tmp -L /lustre/Anson/ref/hg38/wgs_calling_regions_split/0016-scattered.interval_list -O /lustre/Anson/working/PGPC_0016/gatk4_hg38/02var/diffRG/HC_split/0016.raw.g.vcf
Hi Anson Wong, is VariantAnnotator on its own taking 4 hours or is it HaplotypeCaller and VariantAnnotator?
Hi Genevieve Brandt (she/her), it's HaplotypeCaller that only took 4 hours. Since VariantAnnotator has not been completed after running for > 1 day, I stopped it.
Thank you for this clarification. I see why this is an issue, VariantAnnotator should definitely be faster than HaplotypeCaller.
Could you check what is happening while VariantAnnotator is running with jstack and share the jstack output? You can do this by starting a new VariantAnnotator command and once the process slows, run jstack on your machine with the process ID of the VariantAnnotator java process.
One other thing to try would be to double the memory allocation from -Xmx8G to -Xmx16G. Let us know how that goes as well!
Hi Genevieve Brandt (she/her),
Thank you for your suggestions. I started again VariantAnnotator commands (same as the one provided above) and tried both -Xmx8G and -Xmx16G, but their speed is the same, both possessing ~5,000 variants per minute according to the log files.
Here is the jstack output of the VariantAnnotator (-Xmx16G) java process:
2022-02-08 18:05:29
Hi Anson Wong,
Thanks for getting back to us so quickly! It looks like the major slowdown is occurring during the BAM I/O. Could you verify that this is the case by running the same VariantAnnotator without the BAM input to confirm that this is the case?
We think the object that is used to read in the bam has no caching and could result in this slowdown.
Let me know what you find,
Hi Genevieve Brandt (she/her),
You are right! Without the BAM input, the processing time is much faster now (> 500,000 variants per minute).
According to the VariantAnnotator's documentation, "-I sample.bam" is an optional argument. May I know then under what condition must I specify the BAM input (e.g. adding a particular type of annotation)? And what is the purpose of having this parameter in VariantAnnotator?
Thanks for your help again!
Some annotations might need the BAM file to calculate the value for each variant. So, if you don't include the BAM file you might miss some annotations. It depends on which annotations you are interested in.
In terms of being able to use the BAM file as is, we have already identified that this is an issue and started working on a fix for it a few years ago, though we were not able to finish the work because of other projects. Here is the github PR:
A workaround for now would be to remove your reference blocks in your input GVCF with SelectVariants --exclude-non-variants true. Then, you should be able to run VariantAnnotator just fine.
Let me know if you have further questions.
Hi everyone I am having similar issues while trying to annotate my file with the gnomAD database. I am not using a bam file as additional input.
It takes around 5 hours for one CHR. I am using GATK
The output is below:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx30G -XX:ParallelGCThreads=10 -jar /run/media/riadh/One Touch/Analysis/gatk- VariantAnnotator -V PE69_chr3.vcf -R /run/media/riadh/One Touch1/Reference_data_b38/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta --resource:gnomad /run/media/riadh/One Touch1/Reference_data_b38/gnomad.genomes.v3.1.2.sites.chr3.vcf.bgz -E gnomad.nhomalt -E gnomad.AC -E gnomad.AF -O Chr3_gnomad.vcf
Hi Riad Hajdarevic,
Thanks for writing in and it does look like VariantAnnotator is running extremely slowly for you (150 sites/minute). How many samples do you have in your VCF? Could you try the suggestion listed above to check what is happening while VariantAnnotator is running with jstack and share the jstack output? Have you tried running the command with a higher memory specification?
Kind regards,
Hi Pamel
there are 3 samples only (a family). I allocated 30G of memory to it ( I have 32 in total). here is the jstack output. Thank you!
Hi Riad Hajdarevic,
Thank you for providing all of this additional information! Could you actually try running this with less memory (maybe 16GB)? It's possible that you're allocating too much of your available memory to the job.
Kind regards,
Hi Pamela,
thank you for your quick reply. Now it says I am out of memory. with this error message. The gnomAD annotation files are quite big is that maybe the reason?
Hi Riad Hajdarevic,
From this output, it looks like you specified 3016Gb of memory rather than 16GB. Was this intentional?
sorry that was a typo I didnt see. I put it on 16G now and It works very slow again. No error so far, but I can see on the generated output file that it goes very slow.
Hi Riad Hajdarevic,
Okay, thank you for letting me know. I'm going to talk to some other members of the GATK team to try to figure out what the issue might be and I will get back to you as soon as I can. Please let me know if you see any progress with this current run of VariantAnnotator.
Kind regards,
Hi Pamela,
thank you for your time an effort. I run VariantAnnotator with a dbSnp file and it went quite fast. It seems that it only slows down when I use gnomAD vcf files for annotation. I hope this helps.
Its me again, just adding info. So what I did I used the older version of the gnomAD data. The gnomADv2liftover to b38 instead of the gnomADv3 and it does around 7500 variants per minute.
Hi Riad Hajdarevic,
Okay, thank you for providing this additional information. I'm glad you were able to run VariantAnnotator quickly with the dbSnp file and the other version of the gnomAD data. That is helpful because it seems that the slowdown is due to the inputs rather the tool. I'm going to continue looking into why this gnomAD file might be running so slowly.
Kind regards,
Hi Riad Hajdarevic,
I just wanted to update you on my discussion with some other members of the GATK team on this issue. It seems that the long runtime is likely just due to the sheer number of variants in the gnomAD files that the tool is attempting to process, which is why the dpSnp file seems to work faster. Additionally, gnomADv3 has approximately 5X more variants than gnomADv2 which seems to be why the older version works faster. The GATK team is working on looking for a solution to the time it takes for the tool to process all these variants, but it looks like the behavior you are seeing is the expected behavior for now. Please let me know if you have any questions.
Kind regards,
Please sign in to leave a comment.