HaplotypeCaller traversing all chromosomes, but they aren't in the output gvcf...
Hi there,
I am using HaplotypeCaller in GATK v 4.0.11.0. I am giving each run 34Gb of memory and 12 threads, but each run hardly makes it through chromosome 2 (there are 17 total chromosomes) given several days of run time.
In fact, even before it runs out of time, the job 'finishes' with no obvious errors. But there is only chromosome 1 and 2 in the output, despite the log claiming it has traversed all the way through chromosome 17.
Now, these genotypes are pretty divergent from the reference genome, so the numerous variants are probably factoring into the long run time - but this doesn't explain why I only seem to call up til chr1 and chr2, which I've figured out is happening by running this command for several of my files:
less file.g.vcf.gz | grep -v '#' | awk '{print $1}' | sort | uniq
The tail end of the log file looks like this (it's a large file):
less HaplotypeCaller_3499113 | tail -n 50
08:36:36.366 INFO ProgressMeter - chr17A:34029100 2843.9 3926180 1380.5
08:36:37.628 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.628 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.628 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.629 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.688 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.689 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.690 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.691 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:37.692 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:39.623 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:39.623 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
08:36:39.682 INFO HaplotypeCaller - No reads filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
08:36:39.682 INFO ProgressMeter - chr17A:34064773 2844.0 3926346 1380.6
08:36:39.682 INFO ProgressMeter - Traversal complete. Processed 3926346 total regions in 2844.0 minutes.
08:36:39.868 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 20.238113409
08:36:39.868 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 12863.461859667
08:36:39.868 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 132245.88 sec
08:36:39.868 INFO HaplotypeCaller - Shutting down engine
[July 7, 2024 8:36:39 AM CDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 2,844.03 minutes.
Runtime.totalMemory()=1026031616
And here is my bash script to our slurm system:
#!/bin/bash --login
#SBATCH -J hapcall_e
#SBATCH --mail-user=cgoeckeritz@hudsonalpha.org
#SBATCH --nodes=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=36g
#SBATCH --time=72:00:00
#SBATCH -o /cluster/home/cgoeckeritz/bwa_full/embryos/gatk/output_files/HaplotypeCaller_%j
#SBATCH --export=INFILE=/cluster/home/cgoeckeritz/bwa_full/embryos/unfinished_embryo_bams_gatk.txt
#SBATCH -a [1-427]%100
FILE=`/bin/sed -n ${SLURM_ARRAY_TASK_ID}p ${INFILE}`
echo ${FILE}
Name=$(echo ${FILE}| sed 's/\/cluster\/home\/cgoeckeritz\/bwa_full\/embryos\///' | sed 's/_sorted.bam//')
echo ${Name}
module purge
module load cluster/gatk
cd /cluster/home/cgoeckeritz/bwa_full/embryos/gatk/
gatk --java-options "-Xmx34g" HaplotypeCaller \
-R /cluster/home/cgoeckeritz/honeycrisp_full_genome/Malus_x_domestica_Honeycrisp_HAP1_v1.1.a1_scaffolded.fasta \
-I ${FILE} \
-O ${Name}.g.vcf.gz \
-ERC GVCF \
-G StandardAnnotation \
-G AS_StandardAnnotation \
-G StandardHCAnnotation \
--native-pair-hmm-threads 12
I planned to combine the gvcfs later and do joint genotyping on certain groups of these samples. Any idea how I can 1) get HaplotypeCaller to truly output sites for all 17 chromosomes, and 2) get the function to run faster/more efficiently? If this isn't enough information to determine potential problems, please let me know.
Thanks so much for your time!
Charity
REQUIRED for all errors and issues:
a) GATK version used:
b) Exact command used:
c) Entire program log:
-
We cannot do anything about what is permitted on your slurm cluster however we can give you our recommendations. It may be possible that your runs are going slow and exceeding the time permitted by slurm therefore they are prematurely stopped and you do not get any additional contigs beyond the first 2.
We recommend you to run HaplotypeCaller in parallel per contig or even per subcontig that is split by long runs of hard masked (Ns) regions. Once each shard is complete, you can gather those pieces to make the complete callset per sample which you can later on combine with GenomicsDBImport.
I hope this helps.
-
Hi Gökalp,
Thanks so much for your response! I was checking the files this morning and found that some of them are getting further - I saw one had reached up to chromosome 8. As far as I can tell it doesn't seem to be a memory or runtime issue.. I can't find any such error in the log files. And they quit earlier than the time allotted for each. I'm unsure what's going on, I'll talk about this with our cluster manager too. Wouldn't GATK report, at the end of the log file, some kind of oom or time limit error?
Thanks for your time, and I appreciate any other ideas you may have!
Kindly,
Charity -
GATK can only tell java or user errors but slurm errors can only be seen by slurm logs which you may need to get help from IT support.
If your read depths and genomic complexity is high then it is expected to observe long runtimes and higher heap size requirements by HaplotypeCaller. It would be better if you can split your genome into managable shards and call them separately.
Please sign in to leave a comment.
3 comments