Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

CombineGVCF on chrX and chrY hg38 running for ever

0

15 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar,

    Are you running the X and Y chromosomes together or with different commands?

    Also, are you running them haploid or diploid? There are diploid performance optimizations that do not exist for haploid, which could be causing this issue.

    We don't have any best practices recommendations for genomes and the Y chromosome, here is another post where this is discussed: https://gatk.broadinstitute.org/hc/en-us/community/posts/360057847692-Calling-variants-on-chromosome-Y

    Best,

    Genevieve

     

    0
    Comment actions Permalink
  • Avatar
    mehar

    I am running HaplotypeCaller andCombineGVCFs on X and Y separately with the commands shown above.

    I have used -ploidy 2 on chrX and chrY which is also the default with HaplotypeCaller.

    "We don't have any best practices recommendations for genomes and the Y chromosome" Does it mean GATK doesn't work on chrX and chrY on human genomes? 

    The above thread suggests to use -ploidy 1 on chrY which i can try. However, i do not understand why i still have the issue with chrX using -ploidy 2 with HC.

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar, can you provide your specific commands without the variables? I only see a command for the X chromosome, not the Y.

    0
    Comment actions Permalink
  • Avatar
    mehar

    Below are the HC and CombineGVCF commands shown for chrX. For chrY, chrX is replaced with chrY in the commands.

    HaplotypeCaller command:

    $gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" \

    HaplotypeCaller \

    -R hg38.fasta \

    -I hg38_analysisReady.bam \

    -O chrX_hg38_AS.g.vcf.gz \

    -ERC GVCF \

    -L chrX \

    -ploidy 2 \

    -G StandardAnnotation \

    -G AS_StandardAnnotation \

    -G StandardHCAnnotation

    CombineGVCF command

    gatk4 --java-options "-Xms10G -Xmx32G -Djava.io.tmpdir=./tmp" \\

    CombineGVCFs \

    -R hg38.fasta \

    --variant X.list \

    -L chrX \

    -O WGS_chrX_hg38_AS.g.vcf.gz

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Are the stack traces continuing to update after the 3 days (the job is still running) or have they stopped (the job is hung)?

    0
    Comment actions Permalink
  • Avatar
    mehar

    It seems to be updating, but very slow. It never had issues on autosomal chromosomes on the same cluster. Assuming the job might be hung, it was re-run 3-4 times already but same behaviour. Below is the tail of the stack trace after 26hrs of the job:

    $ less array_job_err_3596_4294967294.txt | tail

    00:52:16.256 INFO  ProgressMeter -       chrX:120038567           1442.9             734127000         508773.7

    00:52:31.582 INFO  ProgressMeter -       chrX:120038796           1443.2             734128000         508684.3

    00:52:55.470 INFO  ProgressMeter -       chrX:120046913           1443.6             734129000         508544.7

    00:53:08.004 INFO  ProgressMeter -       chrX:120047099           1443.8             734130000         508471.8

    00:53:20.105 INFO  ProgressMeter -       chrX:120047275           1444.0             734131000         508401.5

    00:53:34.444 INFO  ProgressMeter -       chrX:120047479           1444.2             734132000         508318.1

    00:53:50.712 INFO  ProgressMeter -       chrX:120047720           1444.5             734133000         508223.4

    00:54:21.730 INFO  ProgressMeter -       chrX:120049746           1445.0             734134000         508042.2

    00:54:55.438 INFO  ProgressMeter -       chrX:120056687           1445.6             734135000         507845.5

    00:55:12.098 INFO  ProgressMeter -       chrX:120056922           1445.9             734136000         507748.6

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ok thanks for this added information, I'll look into it and let you know when I have updates

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    Hi mehar,

    When I say we don't have best practices for X and Y, that is to say that we don't have any recommendations for how to treat them differently.  At Broad we produce diploid X and Y calls for male and female samples and expect users to do some filtering and post processing downstream.

    Furthermore, when we run our WGS calling, we don't run on the *whole* genome.  Specifically, hg38 has a lot more centromere and telomere sequence with messy satellite repeats compared with hg19/b37/GRCh37.  Can you try excluding the centromere (https://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=1&hgta_group=map&hgta_track=centromeres&hgta_table=centromeres&hgsid=1064576769_mGEKsUbCAX9DtZCcdhWf5lz00VYc) and telomere locations from your X and Y combine tasks?  These have a higher density of variants due to the variable nature of the satellite repeats and most analyses exclude them anyway.  Maybe start with Y since it's a small chromosome.  If they're still running slow, can you compare the variants per minute in the logs between the slow Y run and a faster autosome run?  The two variables controlling the runtime are the density of variants, which I'm trying to correct by excluding repeat regions, and the speed of processing, which we can observe from the logs.  And of course the size of the chromosomes, but there's nothing we can do about that.  :-)

    -Laura

    0
    Comment actions Permalink
  • Avatar
    mehar

    "When I say we don't have best practices for X and Y, that is to say that we don't have any recommendations for how to treat them differently." I understand that, sorry for the nasty comment :)

    " At Broad we produce diploid X and Y calls for male and female samples and expect users to do some filtering and post processing downstream."

    It means using the default '-ploidy 2' option and we do the same here.

    "Furthermore, when we run our WGS calling, we don't run on the *whole* genome."

    Does this mean the telomeric, centromeric regions are excluded at Broad as well on hg38?

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    I believe that the centromeric and telometric regions are in the GVCFs generated by HaplotypeCaller, but excluded from joint calling.

    Let me know how the argument changes go.

    0
    Comment actions Permalink
  • Avatar
    mehar

    Hi Laura,

    I have provided the centromere and telomere regions to exclude. However, it does not seem to have any impact.

    Job without excluding telomeres/centromeres running since 28hrs:

    $ tail array_job_err_3840_4294967294.txt

    15:36:00.214 INFO  ProgressMeter -        chrY:10755554           1701.7               1690000            993.1

    15:36:23.907 INFO  ProgressMeter -        chrY:10755801           1702.1               1691000            993.5

    15:36:52.297 INFO  ProgressMeter -        chrY:10756100           1702.5               1692000            993.8

    15:37:09.807 INFO  ProgressMeter -        chrY:10756282           1702.8               1693000            994.2

    15:37:34.914 INFO  ProgressMeter -        chrY:10756530           1703.2               1694000            994.6

    15:37:54.069 INFO  ProgressMeter -        chrY:10756742           1703.6               1695000            995.0

    15:38:13.471 INFO  ProgressMeter -        chrY:10756945           1703.9               1696000            995.4

    15:38:34.393 INFO  ProgressMeter -        chrY:10757160           1704.2               1697000            995.8

    15:38:58.948 INFO  ProgressMeter -        chrY:10757468           1704.6               1698000            996.1

    15:39:37.826 INFO  ProgressMeter -        chrY:10757925           1705.3               1699000            996.3

    $ less array_job_err_4071_4294967294.txt

    Job excluding telomeres/centromeres running since 3hrs:

    $ less array_job_err_4071_4294967294.txt | tail

    15:27:20.168 INFO  ProgressMeter -         chrY:3249945            134.0                100000            746.4

    15:28:52.077 INFO  ProgressMeter -         chrY:3254144            135.5                101000            745.3

    15:30:28.143 INFO  ProgressMeter -         chrY:3263419            137.1                102000            743.9

    15:32:09.308 INFO  ProgressMeter -         chrY:3272477            138.8                103000            742.1

    15:33:43.451 INFO  ProgressMeter -         chrY:3275012            140.4                104000            740.9

    15:34:44.530 INFO  ProgressMeter -         chrY:3276626            141.4                105000            742.7

    15:35:52.962 INFO  ProgressMeter -         chrY:3278525            142.5                106000            743.7

    15:37:04.926 INFO  ProgressMeter -         chrY:3279618            143.7                107000            744.5

    15:38:06.581 INFO  ProgressMeter -         chrY:3280424            144.8                108000            746.1

    15:39:43.006 INFO  ProgressMeter -         chrY:3285509            146.4                109000            744.8

    In both cases the Variants/Minute aren't looking great and infact Variants/Minute are less in the command after excluding centromere/telomeric regions.

    I really wonder if this is a generic issue only to me and nobody else has seen such a behaviour earlier.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    mehar I was wondering if there is a reason you are not using GenomicsDBImport?

    0
    Comment actions Permalink
  • Avatar
    Laura Gauthier

    This could be tough to profile/benchmark.  Can you run 3,000,000-28,000,000 for both chr1 and chrY?  Hopefully it's relatively quick.  Then I'm interested in the runtimes for the two and the number of variants in the resulting file for the two.  If that's not informative, then the next thing I would compare is the number of multi-allelics between the two.  If X and Y have a lot more alleles per site that will really slow things down when we merge variants and generate the big fat PL arrays for each sample.

    0
    Comment actions Permalink
  • Avatar
    mehar

    CombineGVCFs suceeded on chrY after 3 days but chrX didn't.

    I did run GenomicsDBImport on chrX now. This is also weird, it is showing the same log since 56hrs. Very strange :(

    $ java -version
    openjdk version "1.8.0_282"
    OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08)
    OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
    java -Xmx6G -jar  /gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar \
    GenomicsDBImport \
    --genomicsdb-workspace-path wgs-chrX-db \
    --sample-name-map BGIgVCF.map \
    -L chrX

    $ cat array_job_err_4419.txt

    14:18:40.740 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so

    Mar 30, 2021 2:18:41 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine

    INFO: Failed to detect whether we are running on Google Compute Engine.

    14:18:41.060 INFO  GenomicsDBImport - ------------------------------------------------------------

    14:18:41.061 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.9.0

    14:18:41.061 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/

    14:18:41.061 INFO  GenomicsDBImport - Executing as arumilli@hackman-1-20.it.helsinki.fi on Linux v5.4.0-67-generic amd64

    14:18:41.068 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_282-8u282-b08-0ubuntu1~20.04-b08

    14:18:41.068 INFO  GenomicsDBImport - Start Date/Time: March 30, 2021 2:18:40 PM EEST

    14:18:41.068 INFO  GenomicsDBImport - ------------------------------------------------------------

    14:18:41.068 INFO  GenomicsDBImport - ------------------------------------------------------------

    14:18:41.069 INFO  GenomicsDBImport - HTSJDK Version: 2.23.0

    14:18:41.069 INFO  GenomicsDBImport - Picard Version: 2.23.3

    14:18:41.069 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2

    14:18:41.069 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

    14:18:41.085 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

    14:18:41.085 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

    14:18:41.085 INFO  GenomicsDBImport - Deflater: IntelDeflater

    14:18:41.085 INFO  GenomicsDBImport - Inflater: IntelInflater

    14:18:41.085 INFO  GenomicsDBImport - GCS max retries/reopens: 20

    14:18:41.085 INFO  GenomicsDBImport - Requester pays: disabled

    14:18:41.085 INFO  GenomicsDBImport - Initializing engine

    14:18:41.757 INFO  IntervalArgumentCollection - Processing 156040895 bp from intervals

    14:18:41.758 INFO  GenomicsDBImport - Done initializing engine

    14:18:42.364 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.3.2-e18fa63

    14:18:42.389 INFO  GenomicsDBImport - Vid Map JSON file will be written to /chrX/wgs-chrX-db/vidmap.json

    14:18:42.389 INFO  GenomicsDBImport - Callset Map JSON file will be written to /chrX/wgs-chrX-db/callset.json

    14:18:42.389 INFO  GenomicsDBImport - Complete VCF Header will be written to /chrX/wgs-chrX-db/vcfheader.vcf

    14:18:42.389 INFO  GenomicsDBImport - Importing to workspace - /chrX/wgs-chrX-db

    14:18:42.389 INFO  ProgressMeter - Starting traversal

    14:18:42.390 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute

    14:18:45.652 INFO  GenomicsDBImport - Importing batch 1 with 28 samples

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi mehar,

    Please see these suggestions for performance guidelines with GenomicsDBImport. In addition, you can try these following suggestions for runtime and memory usage issues:

    1. Check memory/disk space availability on your end.

    2. Specify java memory usage using java option -Xmx.

    3. Run the gatk command with the gatk wrapper script command line.

    4. Specify a --tmp-dir that has room for all necessary temporary files.

    5. Verify this issue persists with the latest version of GATK.

    6. Check the depth of coverage of your sample at the area of interest.

    The progress meter for GATK tools does not update until the tool processes a certain number of sites. It is possible that there is a memory issue happening at the beginning of chromosome X that is causing your job to not being able to continue, either because of the density of variants or a large number of alleles per variant.

    I would check again the memory allocation you are giving this command because 6G may not be enough. As it describes in the GDBI usage and performance guidelines, you also want to make sure there is more available physical memory than the amount you specify in the Xmx parameter. How much physical memory are you setting for these commands?

    Another check you could do is try to run in the middle of chromosome X (change the interval parameter) to see if the beginning of the chromosome is causing the issue.

    Best,

    Genevieve

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk