HaplotypeCaller too many alternative alleles found
AnsweredHi I'm calling raw variants with the intent of using them for base re-calibration and have noticed that for some sites HaplotypeCaller gives me this warning.
I have whole genome seq data for 23 diploid individuals, 11 from one subspecies and 12 from another.
Gen size is 1.2G, and they are all male songbirds. I do have a ref genome but no gold vcf.
I would like to know what's the suggestion. That I leave the max alt alleles at 6 as the default for the program or that I change this to accommodate more alt alleles.
java -jar /users/mfariasv/data/mfariasv/install/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R newzf20/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I RSFV1A_match.bam
...
-I RSFV1Z_match.bam -L NC_044998.1 --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES -stand_call_conf 30 -mbq 20 -hets 0.006 -nct 4 -o raw_variantsZF_NC_044998.1.vcf
Thank you
-
Hi Madza Farias-Virgens, are you running GATK version 3.8? We are only supporting GATK4 at this time. I would recommend updating, you may also find better results.
-
Hello!
Yeah, they have a module for gatk 4.1.6.0 in the cluster. Which is great. And I just did a local install of pyspark because I would like to have it multithreading.
Tried to run
gatk HaplotypeCaller -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I RSFV1A_match.bam
...
-I RSFV1Z_match.bam -L NC_044998.1 --output-mode EMIT_ALL_SITES --stand-call-conf 30 -mbq 20 -hets 0.006 --spark-runner SPARK --spark-master local[2] --num-executors 5 --executor-cores 2 --executor-memory 4g --conf spark.executor.memoryOverhead=600 -o raw_variantsZF_NC_044998.1.vcf
and got the error
20/07/22 15:28:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Appreciate any ideas on what could be wrong here
EDIT:
Got rid of the 1st warning 20/07/22 15:28:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\
by
export LD_LIBRARY_PATH=/users/mfariasv/data/mfariasv/install/hadoop-2.7.3/lib/native
Still getting errors
Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.EDIT2
Got rid of some of the warnings by installing from https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
instead of pip install pyspark
Errors now are:
Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
20/07/22 21:27:56 INFO ShutdownHookManager: Shutdown hook called
20/07/22 21:27:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-54f18356-fba9-4ff1-9212-6f90b15b385f -
madzayasodara we have a HaplotypeCaller version using spark that you may want to use instead: https://gatk.broadinstitute.org/hc/en-us/articles/360046222131-HaplotypeCallerSpark-BETA-
-
Hello, yeah still get the same error either by using HaplotypeCaller or HaplotypeCallerSpark
Error: Failed to load org.broadinstitute.hellbender.Main: org/apache/logging/log4j/core/appender/AbstractAppender
20/07/23 15:09:01 INFO ShutdownHookManager: Shutdown hook called
20/07/23 15:09:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-6e79f273-39ce-41b1-8319-829820a57827Also contacted people at the cluster, whom suggested I do the bellow and use that pyspark installation, but I get same errors.
module load python/3.7.4
virtualenv -p python3 Pyspark
source Pyspark/bin/activate
pip install pysparkThank you
-
This looks like an issue on your end with the machine and setting up the multi-threading.
The command you gave:
gatk HaplotypeCaller -R GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fa -I RSFV1A_match.bam
...
-I RSFV1Z_match.bam -L NC_044998.1 --output-mode EMIT_ALL_SITES --stand-call-conf 30 -mbq 20 -hets 0.006 --spark-runner SPARK --spark-master local[2] --num-executors 5 --executor-cores 2 --executor-memory 4g --conf spark.executor.memoryOverhead=600 -o raw_variantsZF_NC_044998.1.vcf
This will not work for HaplotypeCaller because you are using spark options that are not available in the options for the normal HaplotypeCaller tool.
The GATK support team is focused on issues with the tools or abnormal results. So, please let me know if you find out that it is an issue with the tool. For other issues, we may not be able to provide a solution. Other community members can chime in if they have successfully set up HaplotypeCallerSpark and have advice for this issue!
For context, check out our support policy.
-
See if I understood: GATK4 HaplotypeCaller can't multithread like GATK3 HaplotypeCaller, as GATK4 HP won't accept neither options -nct or -nt.
The multithreading version of GATK4 HaplotypeCaller is actually HaplotypeCallerSpark, as spark args aren't taken by GATK4 HP.
I'll continue the investigation with people at Brown U CCV. Thanks!
the general instructions I was using from here (see bellow) are not exactly accurate https://github.com/broadinstitute/gatk because it doesn't work like this for all gatk4 tools.
Running GATK4 Spark tools on a Spark cluster:
./gatk ToolName toolArguments -- --spark-runner SPARK --spark-master <master_url> additionalSparkArguments
-
Examples:
./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \ -- \ --spark-runner SPARK --spark-master <master_url>
./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \ -- \ --spark-runner SPARK --spark-master <master_url> \ --num-executors 5 --executor-cores 2 --executor-memory 4g \ --conf spark.executor.memoryOverhead=600
-
-
Thanks for pointing out that clarification we can make, I'll send a note to our documentation team!
-
Hello Genevieve Brandt (she/her) I managed to get rid of that error. I forgot to module load the needed python version and source my env in my slurm submission
module load python/3.7.4
source ~/Pyspark/bin/activate
module load gatk/4.1.6.0However, the cluster personnel informed me that the tool still doesn't leverage the 4 cores cores I ask (--spark-runner LOCAL --spark-master local[4] --conf spark.executor.memoryOverhead=600).
So, I decided to go with GATK4 w/o multithreading, which indeed took care of the initial error in this thread "multiple alleles found", but it took 5 days to finish my 64 tasks, doing 15 tasks in parallel.
-
Hi madzayasodara, what is your complete command? And from that command, what is going wrong?
Please sign in to leave a comment.
9 comments