Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Haplotype caller taking too long

0

11 comments

  • Avatar
    Rami Kheireddine

    The code ran for 7 days and it did not finish. Any suggestions on how to make it go faster ?

     

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Hi Rami Kheireddine

    Can you share the full log here? HaplotypeCaller runs faster when your system supports latest Intel AVX+ instruction sets. Does your system support AVX? Can you share your system details as well? 

    If your system permits you may be able to accelerate HaplotypeCaller process by splitting your intervals into seperate instances and call them in parallel. Then you may merge all your calls into a single file. 

    Regards. 

    0
    Comment actions Permalink
  • Avatar
    Rami Kheireddine

    The log is very long (went on for 6 days). I stopped it at the end. Yes my system support AVX.

    Can you help me split the intervals into separate instances? 

    0
    Comment actions Permalink
  • Avatar
    Rami Kheireddine

     gatk-4.4.0.0/gatk HaplotypeCaller -R /home/databench/GRCz11_RefData/GRCz11.fa -I Routput_sorted_dedup_bqsr_reads.bam -O Routput_raw_variants.vcf

    Using GATK jar /home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar

    Running:

        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar HaplotypeCaller -R /home/databench/GRCz11_RefData/GRCz11.fa -I Routput_sorted_dedup_bqsr_reads.bam -O Routput_raw_variants.vcf

    22:25:04.962 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

    22:25:04.989 INFO  HaplotypeCaller - ------------------------------------------------------------

    22:25:04.992 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.4.0.0

    22:25:04.992 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/

    22:25:04.992 INFO  HaplotypeCaller - Executing as lab_anjohnson@anjohnson-pve-linux on Linux v5.10.0-23-amd64 amd64

    22:25:04.992 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v17.0.6+10-Debian-1deb11u1

    22:25:04.992 INFO  HaplotypeCaller - Start Date/Time: July 10, 2023 at 10:25:04 PM CDT

    22:25:04.992 INFO  HaplotypeCaller - ------------------------------------------------------------

    22:25:04.992 INFO  HaplotypeCaller - ------------------------------------------------------------

    22:25:04.993 INFO  HaplotypeCaller - HTSJDK Version: 3.0.5

    22:25:04.993 INFO  HaplotypeCaller - Picard Version: 3.0.0

    22:25:04.993 INFO  HaplotypeCaller - Built for Spark Version: 3.3.1

    22:25:04.993 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2

    22:25:04.994 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

    22:25:04.994 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

    22:25:04.994 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

    22:25:04.994 INFO  HaplotypeCaller - Deflater: IntelDeflater

    22:25:04.994 INFO  HaplotypeCaller - Inflater: IntelInflater

    22:25:04.994 INFO  HaplotypeCaller - GCS max retries/reopens: 20

    22:25:04.994 INFO  HaplotypeCaller - Requester pays: disabled

    22:25:04.995 INFO  HaplotypeCaller - Initializing engine

    22:25:05.121 INFO  HaplotypeCaller - Done initializing engine

    22:25:05.128 INFO  HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output

    22:25:05.138 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so

    22:25:05.139 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/databench/gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so

    22:25:05.151 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM

    22:25:05.151 INFO  IntelPairHmm - Available threads: 4

    22:25:05.151 INFO  IntelPairHmm - Requested threads: 4

    22:25:05.151 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation

    22:25:05.170 INFO  ProgressMeter - Starting traversal

    22:25:05.171 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute

    22:25:05.452 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position 1:2036 and possibly subsequent; at least 10 samples must have called genotypes

    22:25:16.054 INFO  ProgressMeter -              1:20704              0.2                   150            827.0

    22:25:27.092 INFO  ProgressMeter -              1:48916              0.4                   370           1012.8

    22:25:38.054 INFO  ProgressMeter -             1:121562              0.5                   850           1551.0

    22:25:48.866 INFO  ProgressMeter -             1:154324              0.7                  1080           1483.0

    22:25:59.420 INFO  ProgressMeter -             1:190904              0.9                  1330           1471.0

    22:26:10.782 INFO  ProgressMeter -             1:225478              1.1                  1560           1426.6

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    Is this running on a virtual machine such as VirtualBox? Those instances may have very limited IO speeds and looks like you also limited the number of threads available to your VM. I would suggest you to use a native linux instance or use the docker version of gatk over your main OS without IO and thread restrictions and you will see an increase in performance. 

    0
    Comment actions Permalink
  • Avatar
    Rami Kheireddine

    Yes this is a virtual machine. I managed to increase the number of threads on it. And it is currently moving 3 times faster. Will this run take a long time to finish?

     

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    I bet the biggest time lost is with the IO performance. Increasing PairHMM thread count may increase the speed of PairHMM but not the other parts where IO performance is limited. Lets see if this runs till the end. 

    0
    Comment actions Permalink
  • Avatar
    Rami Kheireddine

    Is there a way to tell how long it will take to be done, or when it is close to the end?

     

    0
    Comment actions Permalink
  • Avatar
    SkyWarrior

    There was a way in the older GATK3.x releases but for 4.0 you have to guestimate based on the length of your contigs. Looking at the size of the Zebrafish genome it should be fairly sooner than the human one but I cannot be sure about the complexity of the genome so your mileage may vary. 

    0
    Comment actions Permalink
  • Avatar
    Rami Kheireddine

    The current locus is 3 in 238 minutes. I assume there are 26 thousand loci in Zebra Fish. So, I guess I have a long wait! :( 

     

    0
    Comment actions Permalink
  • Avatar
    Rami Kheireddine

    Hello Skywarrior! Just wanted to let you know the run is finally over (took 28hrs).

    Thank you so much for your response, I really appreciate it.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk