CombineGVCF on chrX and chrY hg38 running for ever
I have been running HaplotypeCaller (HC) on 25 WGS samples on hg38 aligned BAM files. HC was run on the entire genome including autosomes and sex chromosomes in two different ways 1) HC on each chromosome with `-L` parameter
a) GATK version used: v4.1.9.0
b) Exact command used:
HaplotypeCaller Command shown below per chromosome.
java -Xmx6G -jar gatk-package-4.1.9.0-local.jar HaplotypeCaller -R $REF -I ${f}_hg38.bam -O ${f}_chrX_hg38.vcf.gz -ERC GVCF -L chrX -ploidy 2 -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation
and HC second approach 2) on all chromosomes together without `-L` parameter.
HC produced gvcf files successfully in both the approcahes described above, without any errors in the log files.
Then CombineGVCF was run to combine the GVCFs from all samples. 1) combining per chromosome vcfs from HC approach 1 and 2) combining all chromosomes vcf from HC approach 2.
CombineGVCF command shown below per chromosome:
java -Xmx6G -jar gatk-package-4.1.9.0-local.jar CombineGVCFs -R $REF --variant chrX.list -O WGS_chrX_hg38_AS.g.vcf.gz
The issue is that CombineGVCF generates the combined GVCF on all autosomal chromosomes in few hours in both the approaches but chrX and chrY combine GVCFs keeps running for 3-4 days and still pending the completion. I have re-run this 2-3 times already assuming if this any generic glitch on the HPC cluster. The situation remains the same. CombineGVCF is never ending on chrX and chrY.
The log file from CombineGVCF looks normal except that the `ProgressMeter` is quiet slow.
The log files from HaplotypeCaller are normal with the "HaplotypeCaller done" note in the log. However, the gVCF sizes vary among some samples as shown below on chrY. Similar observations on chrX vcf files.
70M Mar 19 21:24 1-chrY_hg38_AS.g.vcf.gz
84M Mar 19 21:32 2-chrY_hg38_AS.g.vcf.gz
87M Mar 19 21:29 3-chrY_hg38_AS.g.vcf.gz
87M Mar 19 21:28 4-chrY_hg38_AS.g.vcf.gz
4.5M Mar 19 21:21 5-chrY_hg38_AS.g.vcf.gz
5.0M Mar 19 21:17 6-chrY_hg38_AS.g.vcf.gz
4.4M Mar 19 21:13 7-chrY_hg38_AS.g.vcf.gz
74M Mar 19 21:21 8-chrY_hg38_AS.g.vcf.gz
4.4M Mar 19 21:13 9-chrY_hg38_AS.g.vcf.gz
4.1M Mar 19 21:27 10-chrY_hg38_AS.g.vcf.gz
68M Mar 19 21:34 11-chrY_hg38_AS.g.vcf.gz
76M Mar 19 21:37 11-chrY_hg38_AS.g.vcf.gz
60M Mar 19 21:41 12-chrY_hg38_AS.g.vcf.gz
3.8M Mar 19 21:35 13-chrY_hg38_AS.g.vcf.gz
32K Mar 19 21:26 14-chrY_hg38_AS.g.vcf.gz
4.3M Mar 19 21:39 15-chrY_hg38_AS.g.vcf.gz
82M Mar 19 21:48 16-chrY_hg38_AS.g.vcf.gz
83M Mar 19 21:52 17-chrY_hg38_AS.g.vcf.gz
3.9M Mar 19 21:44 18-chrY_hg38_AS.g.vcf.gz
83M Mar 19 22:06 19-chrY_hg38_AS.g.vcf.gz
4.9M Mar 19 21:51 20-chrY_hg38_AS.g.vcf.gz
86M Mar 19 22:45 21-chrY_hg38_AS.g.vcf.gz
72M Mar 19 21:54 22-chrY_hg38_AS.g.vcf.gz
4.2M Mar 19 21:59 23-chrY_hg38_AS.g.vcf.gz
67M Mar 19 22:02 24-chrY_hg38_AS.g.vcf.gz
86M Mar 19 22:43 25-chrY_hg38_AS.g.vcf.gz
85M Mar 19 22:43 26-chrY_hg38_AS.g.vcf.gz
32M Mar 19 22:00 27-chrY_hg38_AS.g.vcf.gz
The log files of these relatively low file size samples also look normal. Could some one help what could be the cause for this weird issue which was never before in my experience working with GATK.
-
Hi mehar,
Are you running the X and Y chromosomes together or with different commands?
Also, are you running them haploid or diploid? There are diploid performance optimizations that do not exist for haploid, which could be causing this issue.
We don't have any best practices recommendations for genomes and the Y chromosome, here is another post where this is discussed: https://gatk.broadinstitute.org/hc/en-us/community/posts/360057847692-Calling-variants-on-chromosome-Y
Best,
Genevieve
-
I am running HaplotypeCaller andCombineGVCFs on X and Y separately with the commands shown above.
I have used -ploidy 2 on chrX and chrY which is also the default with HaplotypeCaller.
"We don't have any best practices recommendations for genomes and the Y chromosome" Does it mean GATK doesn't work on chrX and chrY on human genomes?
The above thread suggests to use -ploidy 1 on chrY which i can try. However, i do not understand why i still have the issue with chrX using -ploidy 2 with HC.
-
Hi mehar, can you provide your specific commands without the variables? I only see a command for the X chromosome, not the Y.
-
Below are the HC and CombineGVCF commands shown for chrX. For chrY, chrX is replaced with chrY in the commands.
HaplotypeCaller command:
$gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" \
HaplotypeCaller \
-R hg38.fasta \
-I hg38_analysisReady.bam \
-O chrX_hg38_AS.g.vcf.gz \
-ERC GVCF \
-L chrX \
-ploidy 2 \
-G StandardAnnotation \
-G AS_StandardAnnotation \
-G StandardHCAnnotation
CombineGVCF command
gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" \\
CombineGVCFs \
-R hg38.fasta \
--variant X.list \
-L chrX \
-O WGS_chrX_hg38_AS.g.vcf.gz
-
Are the stack traces continuing to update after the 3 days (the job is still running) or have they stopped (the job is hung)?
-
It seems to be updating, but very slow. It never had issues on autosomal chromosomes on the same cluster. Assuming the job might be hung, it was re-run 3-4 times already but same behaviour. Below is the tail of the stack trace after 26hrs of the job:
$ less array_job_err_3596_4294967294.txt | tail
00:52:16.256 INFO ProgressMeter - chrX:120038567 1442.9 734127000 508773.7
00:52:31.582 INFO ProgressMeter - chrX:120038796 1443.2 734128000 508684.3
00:52:55.470 INFO ProgressMeter - chrX:120046913 1443.6 734129000 508544.7
00:53:08.004 INFO ProgressMeter - chrX:120047099 1443.8 734130000 508471.8
00:53:20.105 INFO ProgressMeter - chrX:120047275 1444.0 734131000 508401.5
00:53:34.444 INFO ProgressMeter - chrX:120047479 1444.2 734132000 508318.1
00:53:50.712 INFO ProgressMeter - chrX:120047720 1444.5 734133000 508223.4
00:54:21.730 INFO ProgressMeter - chrX:120049746 1445.0 734134000 508042.2
00:54:55.438 INFO ProgressMeter - chrX:120056687 1445.6 734135000 507845.5
00:55:12.098 INFO ProgressMeter - chrX:120056922 1445.9 734136000 507748.6
-
Ok thanks for this added information, I'll look into it and let you know when I have updates
-
Hi mehar,
When I say we don't have best practices for X and Y, that is to say that we don't have any recommendations for how to treat them differently. At Broad we produce diploid X and Y calls for male and female samples and expect users to do some filtering and post processing downstream.
Furthermore, when we run our WGS calling, we don't run on the *whole* genome. Specifically, hg38 has a lot more centromere and telomere sequence with messy satellite repeats compared with hg19/b37/GRCh37. Can you try excluding the centromere (https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=1&hgta_group=map&hgta_track=centromeres&hgta_table=centromeres&hgsid=1064576769_mGEKsUbCAX9DtZCcdhWf5lz00VYc) and telomere locations from your X and Y combine tasks? These have a higher density of variants due to the variable nature of the satellite repeats and most analyses exclude them anyway. Maybe start with Y since it's a small chromosome. If they're still running slow, can you compare the variants per minute in the logs between the slow Y run and a faster autosome run? The two variables controlling the runtime are the density of variants, which I'm trying to correct by excluding repeat regions, and the speed of processing, which we can observe from the logs. And of course the size of the chromosomes, but there's nothing we can do about that. :-)
-Laura
-
"When I say we don't have best practices for X and Y, that is to say that we don't have any recommendations for how to treat them differently." I understand that, sorry for the nasty comment :)
" At Broad we produce diploid X and Y calls for male and female samples and expect users to do some filtering and post processing downstream."
It means using the default '-ploidy 2' option and we do the same here.
"Furthermore, when we run our WGS calling, we don't run on the *whole* genome."
Does this mean the telomeric, centromeric regions are excluded at Broad as well on hg38?
-
I believe that the centromeric and telometric regions are in the GVCFs generated by HaplotypeCaller, but excluded from joint calling.
Let me know how the argument changes go.
-
Hi Laura,
I have provided the centromere and telomere regions to exclude. However, it does not seem to have any impact.
Job without excluding telomeres/centromeres running since 28hrs:
$ tail array_job_err_3840_4294967294.txt
15:36:00.214 INFO ProgressMeter - chrY:10755554 1701.7 1690000 993.1
15:36:23.907 INFO ProgressMeter - chrY:10755801 1702.1 1691000 993.5
15:36:52.297 INFO ProgressMeter - chrY:10756100 1702.5 1692000 993.8
15:37:09.807 INFO ProgressMeter - chrY:10756282 1702.8 1693000 994.2
15:37:34.914 INFO ProgressMeter - chrY:10756530 1703.2 1694000 994.6
15:37:54.069 INFO ProgressMeter - chrY:10756742 1703.6 1695000 995.0
15:38:13.471 INFO ProgressMeter - chrY:10756945 1703.9 1696000 995.4
15:38:34.393 INFO ProgressMeter - chrY:10757160 1704.2 1697000 995.8
15:38:58.948 INFO ProgressMeter - chrY:10757468 1704.6 1698000 996.1
15:39:37.826 INFO ProgressMeter - chrY:10757925 1705.3 1699000 996.3
$ less array_job_err_4071_4294967294.txt
Job excluding telomeres/centromeres running since 3hrs:
$ less array_job_err_4071_4294967294.txt | tail
15:27:20.168 INFO ProgressMeter - chrY:3249945 134.0 100000 746.4
15:28:52.077 INFO ProgressMeter - chrY:3254144 135.5 101000 745.3
15:30:28.143 INFO ProgressMeter - chrY:3263419 137.1 102000 743.9
15:32:09.308 INFO ProgressMeter - chrY:3272477 138.8 103000 742.1
15:33:43.451 INFO ProgressMeter - chrY:3275012 140.4 104000 740.9
15:34:44.530 INFO ProgressMeter - chrY:3276626 141.4 105000 742.7
15:35:52.962 INFO ProgressMeter - chrY:3278525 142.5 106000 743.7
15:37:04.926 INFO ProgressMeter - chrY:3279618 143.7 107000 744.5
15:38:06.581 INFO ProgressMeter - chrY:3280424 144.8 108000 746.1
15:39:43.006 INFO ProgressMeter - chrY:3285509 146.4 109000 744.8
In both cases the Variants/Minute aren't looking great and infact Variants/Minute are less in the command after excluding centromere/telomeric regions.
I really wonder if this is a generic issue only to me and nobody else has seen such a behaviour earlier.
-
mehar I was wondering if there is a reason you are not using GenomicsDBImport?
-
This could be tough to profile/benchmark. Can you run 3,000,000-28,000,000 for both chr1 and chrY? Hopefully it's relatively quick. Then I'm interested in the runtimes for the two and the number of variants in the resulting file for the two. If that's not informative, then the next thing I would compare is the number of multi-allelics between the two. If X and Y have a lot more alleles per site that will really slow things down when we merge variants and generate the big fat PL arrays for each sample.
-
CombineGVCFs suceeded on chrY after 3 days but chrX didn't.
I did run GenomicsDBImport on chrX now. This is also weird, it is showing the same log since 56hrs. Very strange :(
$ java -version
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)java -Xmx6G -jar /gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar \
GenomicsDBImport \
--genomicsdb-workspace-path wgs-chrX-db \
--sample-name-map BGIgVCF.map \
-L chrX$ cat array_job_err_4419.txt
14:18:40.740 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Mar 30, 2021 2:18:41 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
14:18:41.060 INFO GenomicsDBImport - ------------------------------------------------------------
14:18:41.061 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.9.0
14:18:41.061 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
14:18:41.061 INFO GenomicsDBImport - Executing as arumilli@hackman-1-20.it.helsinki.fi on Linux v5.4.0-67-generic amd64
14:18:41.068 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
14:18:41.068 INFO GenomicsDBImport - Start Date/Time: March 30, 2021 2:18:40 PM EEST
14:18:41.068 INFO GenomicsDBImport - ------------------------------------------------------------
14:18:41.068 INFO GenomicsDBImport - ------------------------------------------------------------
14:18:41.069 INFO GenomicsDBImport - HTSJDK Version: 2.23.0
14:18:41.069 INFO GenomicsDBImport - Picard Version: 2.23.3
14:18:41.069 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
14:18:41.069 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:18:41.085 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:18:41.085 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:18:41.085 INFO GenomicsDBImport - Deflater: IntelDeflater
14:18:41.085 INFO GenomicsDBImport - Inflater: IntelInflater
14:18:41.085 INFO GenomicsDBImport - GCS max retries/reopens: 20
14:18:41.085 INFO GenomicsDBImport - Requester pays: disabled
14:18:41.085 INFO GenomicsDBImport - Initializing engine
14:18:41.757 INFO IntervalArgumentCollection - Processing 156040895 bp from intervals
14:18:41.758 INFO GenomicsDBImport - Done initializing engine
14:18:42.364 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63
14:18:42.389 INFO GenomicsDBImport - Vid Map JSON file will be written to /chrX/wgs-chrX-db/vidmap.json
14:18:42.389 INFO GenomicsDBImport - Callset Map JSON file will be written to /chrX/wgs-chrX-db/callset.json
14:18:42.389 INFO GenomicsDBImport - Complete VCF Header will be written to /chrX/wgs-chrX-db/vcfheader.vcf
14:18:42.389 INFO GenomicsDBImport - Importing to workspace - /chrX/wgs-chrX-db
14:18:42.389 INFO ProgressMeter - Starting traversal
14:18:42.390 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
14:18:45.652 INFO GenomicsDBImport - Importing batch 1 with 28 samples
-
Hi mehar,
Please see these suggestions for performance guidelines with GenomicsDBImport. In addition, you can try these following suggestions for runtime and memory usage issues:
-
Check memory/disk space availability on your end.
-
Specify java memory usage using java option -Xmx.
-
Run the gatk command with the gatk wrapper script command line.
-
Specify a --tmp-dir that has room for all necessary temporary files.
-
Verify this issue persists with the latest version of GATK.
-
Check the depth of coverage of your sample at the area of interest.
The progress meter for GATK tools does not update until the tool processes a certain number of sites. It is possible that there is a memory issue happening at the beginning of chromosome X that is causing your job to not being able to continue, either because of the density of variants or a large number of alleles per variant.
I would check again the memory allocation you are giving this command because 6G may not be enough. As it describes in the GDBI usage and performance guidelines, you also want to make sure there is more available physical memory than the amount you specify in the Xmx parameter. How much physical memory are you setting for these commands?
Another check you could do is try to run in the middle of chromosome X (change the interval parameter) to see if the beginning of the chromosome is causing the issue.
Best,
Genevieve
-
Please sign in to leave a comment.
15 comments