Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

HaplotypeCaller too many alternative alleles found

Answered
0

9 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Madza Farias-Virgens, are you running GATK version 3.8? We are only supporting GATK4 at this time. I would recommend updating, you may also find better results.

    1
    Comment actions Permalink
  • Avatar
    madzayasodara

    Hello! 

    Yeah, they have a module for gatk 4.1.6.0 in the cluster. Which is great. And I just did a local install of pyspark because I would like to have it multithreading.

    Tried to run 

    gatk HaplotypeCaller -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I RSFV1A_match.bam

    ...

    -I RSFV1Z_match.bam -L NC_044998.1 --output-mode EMIT_ALL_SITES --stand-call-conf 30 -mbq 20 -hets 0.006 --spark-runner SPARK --spark-master local[2] --num-executors 5 --executor-cores 2 --executor-memory 4g --conf spark.executor.memoryOverhead=600 -o raw_variantsZF_NC_044998.1.vcf

    and got the error

     

    20/07/22 15:28:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
    log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

    Appreciate any ideas on what could be wrong here

    EDIT:

    Got rid of the 1st warning 20/07/22 15:28:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\

    by 

    export LD_LIBRARY_PATH=/users/mfariasv/data/mfariasv/install/hadoop-2.7.3/lib/native

    Still getting errors 


    Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
    log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

    EDIT2

    Got rid of some of the warnings by installing from https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz

    instead of pip install pyspark

    Errors now are:

    Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
    20/07/22 21:27:56 INFO ShutdownHookManager: Shutdown hook called
    20/07/22 21:27:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-54f18356-fba9-4ff1-9212-6f90b15b385f

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    madzayasodara we have a HaplotypeCaller version using spark that you may want to use instead: https://gatk.broadinstitute.org/hc/en-us/articles/360046222131-HaplotypeCallerSpark-BETA-

    1
    Comment actions Permalink
  • Avatar
    madzayasodara

    Hello, yeah still get the same error either by using HaplotypeCaller or HaplotypeCallerSpark

    Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
    20/07/23 15:09:01 INFO ShutdownHookManager: Shutdown hook called
    20/07/23 15:09:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-6e79f273-39ce-41b1-8319-829820a57827

    Also contacted people at the cluster, whom suggested I do the bellow and use that pyspark installation, but I get same errors.

    module load python/3.7.4
    virtualenv -p python3 Pyspark
    source Pyspark/bin/activate
    pip install pyspark

    Thank you

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    This looks like an issue on your end with the machine and setting up the multi-threading. 

    The command you gave:

    gatk HaplotypeCaller -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I RSFV1A_match.bam

    ...

    -I RSFV1Z_match.bam -L NC_044998.1 --output-mode EMIT_ALL_SITES --stand-call-conf 30 -mbq 20 -hets 0.006 --spark-runner SPARK --spark-master local[2] --num-executors 5 --executor-cores 2 --executor-memory 4g --conf spark.executor.memoryOverhead=600 -o raw_variantsZF_NC_044998.1.vcf

    This will not work for HaplotypeCaller because you are using spark options that are not available in the options for the normal HaplotypeCaller tool. 

    The GATK support team is focused on issues with the tools or abnormal results. So, please let me know if you find out that it is an issue with the tool. For other issues, we may not be able to provide a solution. Other community members can chime in if they have successfully set up HaplotypeCallerSpark and have advice for this issue!

    For context, check out our support policy.

    1
    Comment actions Permalink
  • Avatar
    madzayasodara

    See if I understood: GATK4 HaplotypeCaller can't multithread like GATK3 HaplotypeCaller, as GATK4 HP won't accept neither options -nct or -nt.

    The multithreading version of GATK4 HaplotypeCaller is actually HaplotypeCallerSpark, as spark args aren't taken by GATK4 HP.

    I'll continue the investigation with people at Brown U CCV. Thanks! 

    the general instructions I was using from here (see bellow) are not exactly accurate  https://github.com/broadinstitute/gatk because it doesn't work like this for all gatk4 tools.

    Running GATK4 Spark tools on a Spark cluster:

    ./gatk ToolName toolArguments -- --spark-runner SPARK --spark-master <master_url> additionalSparkArguments

    • Examples:

      ./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
          -- \
          --spark-runner SPARK --spark-master <master_url>
      
      ./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \
        -- \
        --spark-runner SPARK --spark-master <master_url> \
        --num-executors 5 --executor-cores 2 --executor-memory 4g \
        --conf spark.executor.memoryOverhead=600

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for pointing out that clarification we can make, I'll send a note to our documentation team! 

    1
    Comment actions Permalink
  • Avatar
    madzayasodara

    Hello Genevieve Brandt (she/her) I managed to get rid of that error. I forgot to module load the needed python version and source my env in my slurm submission 

    module load python/3.7.4
    source ~/Pyspark/bin/activate
    module load gatk/4.1.6.0

    However, the cluster personnel informed me that the tool still doesn't leverage the 4 cores cores I ask (--spark-runner LOCAL --spark-master local[4] --conf spark.executor.memoryOverhead=600). 

    So, I decided to go with GATK4 w/o multithreading, which indeed took care of the initial error in this thread "multiple alleles found", but it took 5 days to finish my 64 tasks, doing 15 tasks in parallel. 

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi madzayasodara, what is your complete command? And from that command, what is going wrong?

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk