Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GermlineCNVCaller edge case?

0

8 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi SkyWarrior, we think this might be from changes to docker that occurred in GATK 4.1.8.0. Would you be able to test this with GATK 4.1.7.0 and see if you get the same issue?

    Also, how many targets and samples were used as input?

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Thanks for the response. I have not tried 4.1.7.0 and it seems it started running fine now on the skylake i9 system. I will extend my tests to other systems as well and repost here once I can confidently tell that the problem is limited to 4.1.8.1 and above. 

    I am trying to keep the number of targets to 500000 at most and number of samples is 14.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update SkyWarrior, let us know if you confirm with those tests. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    I wonder if this may be an issue with numpy 1.17.5, which we updated to in the 4.1.8.0 Docker/conda changes. See https://numpy.org/doc/1.17/release.html: "Downstream developers should use Cython >= 0.29.13 for Python 3.8 support and OpenBLAS >= 3.7 to avoid errors on the Skylake architecture."

    Many of the Docker/conda changes were made to ensure that the CNN tool was able to properly use MKL-enabled numpy and Tensorflow. Specifically, the relevant packages were conda-installed from channels that compiled them against MKL (rather than pip-installed, which resulted in dependency clobbering).

    If this causes issues with more recent architectures, it may be possible to either configure a new conda environment against OpenBLAS packages and/or run GermlineCNVCaller with theano flags that disable MKL (see, from the tool docs: "Advanced users may wish to set the THEANO_FLAGS environment variable to override the GATK theano configuration. For example, by running THEANO_FLAGS="base_compiledir=PATH/TO/BASE_COMPILEDIR" gatk GermlineCNVCaller ..., users can specify the theano compilation directory (which is set to $HOME/.theano by default). See theano documentation at http://deeplearning.net/software/theano/library/config.html.")

    I would expect that gCNV runtime might differ with MKL and OpenBLAS, but probably not substantially. Unfortunately, I don't have the bandwidth at the moment to put together such an environment, but let me know if you need further pointers.

    Note that more changes to the conda environment may be on their way in the next few months (to update PyMC3, which is used by gCNV, and include pyro), so if rolling back to 4.1.7.0 works, I might just stick with that for now.

    Thanks as always SkyWarrior for the detailed reports. It's especially difficult for us to test across different architectures, so any data points you can provide are extremely helpful!

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Thanks Samuel Lee

    I can confidently say that the problem is confined to 4.1.8.1 and above docker images and 4.1.7.0 docker image works fine on all the systems. 

    About the usage of MKL. I am in no position to judge however MKL's bias against certain brand cpus (GenuineIntel vs AuthenticAMD) might effect the performance of CNN and GCNV under disadvantaged CPU's. I am not sure how many people keep track of what they have or use as an infrastructure or cloud but some cloud providers keep serving more alternative CPU's against Intel and yet those may go unnoticed. I currently don't have any AMD cpus but chances are I will procure some in midterm due to pricing differences. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Samuel Lee

    Thanks. I will take a look at using packages compiled against OpenBLAS for the next iteration of the environment and will update here when I can.

    0
    Comment actions Permalink
  • Avatar
    Mehdi

    Hi Everyone,

    As a quick solution, I recommend the following Conda environment which worked well for me. Note that all of them are not required Python dependencies. Dear SkyWarrior your comments are always useful. Thanks.

    channels:
    - bioconda
    - anaconda
    - conda-forge
    - BioBuilds
    dependencies:
    - conda-forge::python=3.6.13
    - picard
    - vcftools
    - bcftools
    - samtools
    - bamtools
    - bioconda::gatk4<4.1.8.1
    - gcnvkernel
    - bwa
    - htslib
    - fastqc
    - multiqc
    - trimmomatic
    - tabix
    - moreutils
    - anaconda::numpy=1.15
    - pip
    - mkl
    - mkl-service
    - theano=
    - tensorflow
    - scipy
    - pymc3
    - h5py
    - keras
    - intel-openmp
    - scikit-learn
    - matplotlib
    - pandas
    - conda-forge::qt=5.12.9
    - conda-forge::libglib=2.70.1
    - conda-forge::libffi=3.4.2
    - biopython
    - pyvcf
    - pysam


    ### Cheers.

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for posting this Mehdi

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk