Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

CombineGVCFs produces .bcf incompatible with recommended bcftools (v1.19).

0

7 comments

  • Avatar
    Gökalp Çelik

    Hi bgulko

    HTSJDK library which our tools depend on does not support bcf version 2.2 

    https://github.com/samtools/htsjdk 

    We recommend producing vcf or vcf.gz files in general which are both compatible with our tools and bcftools. 

    I hope this helps. 

    Regards. 

     

     

    1
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Hi again. 

    Yes you are absolutely right that we may need a warning for the incompatibility and we can take a look at this issue in our next point release. 

    On the other hand, BCF v2.2 spec was not very clear up until recently and since a new VCF format v4.5 is also present our team will focus on getting the latest formats instated for read and write support. However these formats require big changes on GATK end which requires extensive testing and performing merges that may change the way GATK originally operates. BCF v2.2 is a format that only covers part of the VCF spec and looking at the documents it has not been updated like VCF spec, therefore it may not be our absolute priority to integrate that format in near future. In the mean time passing our outputs to latest BCFtools to convert them to BCF could be your only option.

    I hope this clarifies. 

    Regards. 

    1
    Comment actions Permalink
  • Avatar
    bgulko

    Thanks Gokalp!

    With your hint, I see that as well (https://github.com/samtools/htsjdk).

    I also see an issue posted in htsjdk from 2016 (https://github.com/samtools/htsjdk/issues/596) and a suggestion that as of 2022 (release 2.24.1 https://mvnrepository.com/artifact/com.github.samtools/htsjdk), code was in place to support BCF2.2 and the next major release of HTSJDK was to include a dependency required to support it. Major release 3 was offered around 2022-Jun and Major release 4 of htsjdk was released around 2023-Aug.

    Perhaps someone could point me towards the last version of bcftools known to support the bcf version produced by GATK v4? I believe the current version of bcftools is 1.20, and versions as early a 1.13 (the earliest I currently have access to) do not support BCF <v2.2.

    It looks like one version of bcftools that would support the older BCF1 format was packaged with samtools 0.1.19 back in 2013 (ref: http://www.htslib.org/doc/1.0/bcftools.html, download: https://sourceforge.net/projects/samtools/files/samtools/0.1.19/).

    In any case, this certainly clarifies some issues. While GATK examples use a .vcf.gz format, the GATK-recommended version of bcftools is still 1.19 so perhaps a compatibility warning is also warranted.

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Just out of curiosity, can I ask why you absolutely need bcf format?

    0
    Comment actions Permalink
  • Avatar
    bgulko

    Seems the submit button on this form may not have taken, let me try again.

    Yours is a salient question, though "absolute need" is a pretty high bar. With foreknowledge there are nearly always ways of working around such issues, here are a few reasons why GATK treatment of .bcf (as opposed to .vcf.gz) is important to us.

    Empirically,
    -) GATK 4.5 produces .bcf files that are apparently compatible, but actually incompatible, with the toolchain recommended by GATK (bcftools v1.19). This inconsistency required substantial effort to resolve in our pipeline. Either actual support, or an error/warning message indicating lack of support with BCF 2.2, would have saved a lot of time and effort.

    Subjectively,
    -) I have found bcf to have better compression than .vcf.gz, though there is certainly yet-better alternatives (2021, https://academic.oup.com/bioinformatics/article-pdf/37/19/3358/50338115/btab211.pdf )

    -) BCF seems to be a bit stricter than raw VCF, requiring require more complete and consistent VCF information to create. Once successfully created, BCF seems more resistant to subtle inconsistencies than VCF. 

    With compute and storage becoming less expensive, dev/debugging time, compatibility, and resistance to data corruption are my central concerns. Other formats (like .vcf.gz) are certainly feasible so long as we know in advance that we need to use them.

    I hope this is helpful....

    0
    Comment actions Permalink
  • Avatar
    bgulko

    I deeply appreciate your rapid and relevant responses. Hopefully this will save the next programmer to encounter the issue some grief. In my opinion a warning would be fine.

    Any request that GATK support BCF 2.2 was predicated on the idea that BCF 2.2 support was scheduled for the next major htsjdk release after 2, so incorporating support into GATK was more a matter finding the correct configuration for htsjdk, rather than a major development effort in GATK. I also wouldn't recommend prioritizing a substantial BCF compatibility effort over VCF 4.5, though a warning would be quite welcome.

    Thanks again for looking into this!

    --Brad

    0
    Comment actions Permalink
  • Avatar
    Gökalp Çelik

    Thank you for the suggestions.

    so incorporating support into GATK was more a matter finding the correct configuration for htsjdk, rather than a major development effort in GATK.

    Believe us that it is a major development for GATK. It is there in the list but it gets backlogged as there are more major updates ready to go in short to medium term. 

    Regards. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk