CombineGVCFs produces .bcf incompatible with recommended bcftools (v1.19).
Summary:
First, apologies if this is in the wrong place, a quick search did not turn up a reference to this issue, and I thought this might represent a simple oversight.
When the output file type is .bcf, the CombineGVCFs command completes but produces an output file of type .bcf that is not compatible with GATK-recommended bcftools versions. This can be worked around by requesting GATK output as a .vcf.gz format then using the recommended bcftools package (v1.19) to convert from .vcf.gz to a currrent (2024-Jul) version of .bcf .
My apologies in advance if this was corrected in 4.6, or the posting is in the wrong place, I saw no references to this issue and thought it may be latent and perhaps extend to .bcf support beyond the CombineGVCFs command.
REQUIRED for all errors and issues:
a) GATK version used: 4.5.0.0
b) Exact command used:
gatk CombineGVCFs -O combined.g.bcf -V NA12146_father_calls.chr20.g.vcf -V NA12239_mother_calls.chr20.g.vcf -R /mnt/efs/fs1/base/inf/db/refs/38-grch/illumina/GRCh38_full_analysis_set_plus_decoy_hla.fa
c) Entire program log:
Program completes without error but produces a .bcf file incompatible with recommended version of bcftools (v1.19).
$ gatk CombineGVCFs -O combined.g.bcf -V NA12146_father_calls.chr20.g.vcf -V NA12239_mother_calls.chr20.g.vcf -R /mnt/efs/fs1/base/inf/db/refs/38-grch/illumina/GRCh38_full_analysis_set_plus_decoy_hla.fa
Using GATK jar /usr/local/sbin/gatk/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/local/sbin/gatk/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar CombineGVCFs -O combined.g.bcf -V NA12146_father_calls.chr20.g.vcf -V NA12239_mother_calls.chr20.g.vcf -R /mnt/efs/fs1/base/inf/db/refs/38-grch/illumina/GRCh38_full_analysis_set_plus_decoy_hla.fa
21:39:19.125 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/sbin/gatk/gatk-4.5.0.0/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:39:19.215 INFO CombineGVCFs - ------------------------------------------------------------
21:39:19.217 INFO CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.5.0.0
21:39:19.217 INFO CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
21:39:19.217 INFO CombineGVCFs - Executing as proto@ip-172-31-5-47 on Linux v6.5.0-1022-aws amd64
21:39:19.217 INFO CombineGVCFs - Java runtime: OpenJDK 64-Bit Server VM v17.0.11+9-Ubuntu-122.04.1
21:39:19.218 INFO CombineGVCFs - Start Date/Time: July 20, 2024 at 9:39:19 PM UTC
21:39:19.218 INFO CombineGVCFs - ------------------------------------------------------------
21:39:19.218 INFO CombineGVCFs - ------------------------------------------------------------
21:39:19.218 INFO CombineGVCFs - HTSJDK Version: 4.1.0
21:39:19.218 INFO CombineGVCFs - Picard Version: 3.1.1
21:39:19.219 INFO CombineGVCFs - Built for Spark Version: 3.5.0
21:39:19.219 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:39:19.219 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:39:19.219 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:39:19.219 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:39:19.219 INFO CombineGVCFs - Deflater: IntelDeflater
21:39:19.219 INFO CombineGVCFs - Inflater: IntelInflater
21:39:19.219 INFO CombineGVCFs - GCS max retries/reopens: 20
21:39:19.220 INFO CombineGVCFs - Requester pays: disabled
21:39:19.220 INFO CombineGVCFs - Initializing engine
21:39:19.381 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/efs/fs1/base/inf/runs/0002-mvp0/s02_call_preliminary/wrk/gatk_bug/NA12146_father_calls.chr20.g.vcf
21:39:19.403 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/efs/fs1/base/inf/runs/0002-mvp0/s02_call_preliminary/wrk/gatk_bug/NA12239_mother_calls.chr20.g.vcf
21:39:19.437 INFO CombineGVCFs - Done initializing engine
21:39:19.462 INFO ProgressMeter - Starting traversal
21:39:19.463 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
21:39:19.503 INFO ProgressMeter - unmapped 0.0 26 40000.0
21:39:19.503 INFO ProgressMeter - Traversal complete. Processed 26 total variants in 0.0 minutes.
21:39:19.525 INFO CombineGVCFs - Shutting down engine
[July 20, 2024 at 9:39:19 PM UTC] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=331350016
Note: The simple workaround is to send output to a .vcf.gz file and then use bcftools to convert from .vcf.gz to .bcf file. This works fine.
e.g.
gatk CombineGVCFs -O combined.g.vcf.gz -V NA12146_father_calls.chr20.g.vcf -V NA12239_mother_calls.chr20.g.vcf -R /mnt/efs/fs1/base/inf/db/refs/38-grch/illumina/GRCh38_full_analysis_set_plus_decoy_hla.fa
followed by
bcftools view combined.g.vcf.gz --output combined.2.g.bcf
However, I believe gatk produces a .bcf version 2.1 format file, while bcftools produces a version 2.2 file, which is now the required standard.
File produced by GATK 4.5.0.0 (NOTE "BCF" magic numbers in first 3 bytes)
od -a combined.g.bcf | head
0000000 B C F stx soh i rs nul nul # # f i l e f
0000020 o r m a t = V C F v 4 . 2 nl # #
0000040 A L T = < I D = N O N _ R E F ,
0000060 D e s c r i p t i o n = " R e p
.BCF file produced by bcftools v1.19
proto@ip-172-31-5-47:~/wrk/00_src/runs/0002-mvp0/s02_call_preliminary/wrk/gatk_bug$ od -a combined.2.g.bcf | head
0000000 us vt bs eot nul nul nul nul nul del ack nul B C stx nul
0000020 } ht 5 em [ r esc 9 dc1 ack ] { ' ? D P
0000040 r E < nak stx b ` @ etx 7 t nak - ht 2 6
Output files in .bcf format should be compatible with gunzip, but .bcf file produced by GATK is not compatible with gunzip.
gunzip -c combined.g.bcf | head
gzip: combined.g.bcf: not in gzip format
File produced by bcftools (above)
gunzip -c combined.2.g.bcf | head
BCF##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
...
bcftools v1.13 also can not read the .bcf file produced by GATK
bcftools.1.13 view combined.g.bcf | head
[E::bcf_hdr_read] Invalid BCF2 magic string: only BCFv2.2 is supported
Failed to read from combined.g.bcf: could not parse header
-
Hi bgulko
HTSJDK library which our tools depend on does not support bcf version 2.2
https://github.com/samtools/htsjdk
We recommend producing vcf or vcf.gz files in general which are both compatible with our tools and bcftools.
I hope this helps.
Regards.
-
Hi again.
Yes you are absolutely right that we may need a warning for the incompatibility and we can take a look at this issue in our next point release.
On the other hand, BCF v2.2 spec was not very clear up until recently and since a new VCF format v4.5 is also present our team will focus on getting the latest formats instated for read and write support. However these formats require big changes on GATK end which requires extensive testing and performing merges that may change the way GATK originally operates. BCF v2.2 is a format that only covers part of the VCF spec and looking at the documents it has not been updated like VCF spec, therefore it may not be our absolute priority to integrate that format in near future. In the mean time passing our outputs to latest BCFtools to convert them to BCF could be your only option.
I hope this clarifies.
Regards.
-
Thanks Gokalp!
With your hint, I see that as well (https://github.com/samtools/htsjdk).
I also see an issue posted in htsjdk from 2016 (https://github.com/samtools/htsjdk/issues/596) and a suggestion that as of 2022 (release 2.24.1 https://mvnrepository.com/artifact/com.github.samtools/htsjdk), code was in place to support BCF2.2 and the next major release of HTSJDK was to include a dependency required to support it. Major release 3 was offered around 2022-Jun and Major release 4 of htsjdk was released around 2023-Aug.
Perhaps someone could point me towards the last version of bcftools known to support the bcf version produced by GATK v4? I believe the current version of bcftools is 1.20, and versions as early a 1.13 (the earliest I currently have access to) do not support BCF <v2.2.
It looks like one version of bcftools that would support the older BCF1 format was packaged with samtools 0.1.19 back in 2013 (ref: http://www.htslib.org/doc/1.0/bcftools.html, download: https://sourceforge.net/projects/samtools/files/samtools/0.1.19/).In any case, this certainly clarifies some issues. While GATK examples use a .vcf.gz format, the GATK-recommended version of bcftools is still 1.19 so perhaps a compatibility warning is also warranted.
-
Just out of curiosity, can I ask why you absolutely need bcf format?
-
Seems the submit button on this form may not have taken, let me try again.
Yours is a salient question, though "absolute need" is a pretty high bar. With foreknowledge there are nearly always ways of working around such issues, here are a few reasons why GATK treatment of .bcf (as opposed to .vcf.gz) is important to us.Empirically,
-) GATK 4.5 produces .bcf files that are apparently compatible, but actually incompatible, with the toolchain recommended by GATK (bcftools v1.19). This inconsistency required substantial effort to resolve in our pipeline. Either actual support, or an error/warning message indicating lack of support with BCF 2.2, would have saved a lot of time and effort.Subjectively,
-) I have found bcf to have better compression than .vcf.gz, though there is certainly yet-better alternatives (2021, https://academic.oup.com/bioinformatics/article-pdf/37/19/3358/50338115/btab211.pdf )-) BCF seems to be a bit stricter than raw VCF, requiring require more complete and consistent VCF information to create. Once successfully created, BCF seems more resistant to subtle inconsistencies than VCF.
With compute and storage becoming less expensive, dev/debugging time, compatibility, and resistance to data corruption are my central concerns. Other formats (like .vcf.gz) are certainly feasible so long as we know in advance that we need to use them.
I hope this is helpful....
-
I deeply appreciate your rapid and relevant responses. Hopefully this will save the next programmer to encounter the issue some grief. In my opinion a warning would be fine.
Any request that GATK support BCF 2.2 was predicated on the idea that BCF 2.2 support was scheduled for the next major htsjdk release after 2, so incorporating support into GATK was more a matter finding the correct configuration for htsjdk, rather than a major development effort in GATK. I also wouldn't recommend prioritizing a substantial BCF compatibility effort over VCF 4.5, though a warning would be quite welcome.
Thanks again for looking into this!
--Brad
-
Thank you for the suggestions.
so incorporating support into GATK was more a matter finding the correct configuration for htsjdk, rather than a major development effort in GATK.
Believe us that it is a major development for GATK. It is there in the list but it gets backlogged as there are more major updates ready to go in short to medium term.
Regards.
Please sign in to leave a comment.
7 comments