Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs only accepts single combinedGVCF? CombineGVCFs & genomicsdb errors... .interval_list subset

Answered
0

12 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hello Mingzhou,

    Thanks for writing into the forum! I think we can figure out this issue and get GATK working for you. I'm going to address your main concerns:

    1. I would recommend running GenomicsDBImport instead of CombineGVCFs. GenomicsDBImport is much better suited for many samples like you have. In addition, it is much easier to add samples to your database if needed with GenomicsDB.
    2. We have a usage and performance guidelines document available here which has many helpful tips to get GenomicsDBImport working for your large number of samples. I would recommend setting a smaller --batch-size and potentially breaking your analysis up into intervals.
    3. Using --intervals chr11 should be supported, could you please share your complete stack trace so I can see the error message?

    Hope this addresses everything for now, please let me know if you have further questions and I can be of more help.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve,

    Thank you for the reply!

    1. I actually ran GenomicsDB over the weekend but it was stuck and took forever. It was quick when importing first 5-6 batches, but then the time just went extremely long with next batch. It seemed the time went exponentially when importing one batch to the next batch. I checked the forum about optimizing genomicsDB (the link you attahced), unfortunately my GATK4.1.2 doesn't have the --genomicsdb-shared-posixfs-optimizations option. Our hpc doesn't install the newest version of gatk4. I'm wondering if there is an equivalent command as  -genomicsdb-shared-posixfs-optimizations I can add to the bash script? 

    2. I chose the batch-size of 50, which is recommended on the tool page. Do you think I should even reduce the batch size? In terms of intervals, yes, I used the intervals you provided in the bundle. 

    3. I figured it out. For some reason the cmd didn't allow me to specify chr11 directly, but asked for a interval list file. --interval chr.list (in which chr.list has only one line of chr11) worked. 

    Since GenomicsDB didn't work for me, I actually went back to combineGVCFs and it worked for some chromosomes. I first did combineGVCFs for small batches of 300 samples, and then did combineGVCFs again to combine all the small batches together. I found sometimes I got the error "The provided variant file(s) have inconsistent references for the same position(s)", although I was 100% sure the same ref genome was used. I checked the genotypes of my small batches at the same position and did find the ref alleles were different for some batches. However, this error could be removed if I double the batch size to 600. I was wondering why this inconsistent references could happen.. seems sometimes combineGVCFs just put random reference allele at some positions. 

    After combineGVCFs, I used chr21 to test the performance of GenotypeGVCFs and it worked pretty well. However, at VariantRecalibrator stage I sometimes got "no data" error even with ~3000 samples. I found every time this error occurred, the .tranches had 0 size but .recal were intact. I followed some GATK threads by reducing the --maxGaussians by 1-2 and it always resolved the errors. Why too big --maxGaussians would lead to "no data" error while "no data" error was basically due to lack of enough sample size?

    Sorry I just have questions and problem one after another ...

    Thank you very much!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Mingzhou Fu,

    1. I would really recommend sticking with GenomicsDBImport to figure out how to speed it up because CombineGVCFs won't be any better. If you can get your HPC to install the newer version of GATK for that option, I think you will see a lot of improvement. There is no equivalent option in older versions of GATK. If you are seeing an exponential time increase, try giving more memory to the job and also make sure you are specifying a temporary directory. 
    2. You can try decreasing the batch size. The time to run will be slower but it might help with your issue where the job really slows down.
    3. Glad you figured it out!

    I'm not exactly sure why you are getting the "no data" error. It could be because you are only running it on chr21. You can try to run it on a different chromosome and see if you get the same issue.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve,

    Thank you very much for the reply! I will ask the HPC staff to install the newer version of GATK. 

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve & others in the GATK team,

    My GenomicsDBImport still takes forever to run even with the newest version of GATK (4.2.1.0) and the optimization commands you suggested before. Could you please have a look at my exact command the let me know what to do to speed up the process? Thank you!

    command:

    note: I have ~3000 WES to combine and I'm running it on chr base parallel (one job for one chr). For the intervals, I have at most 22232 intervals on a single chr. 

    gatk --java-options "-Xmx120g -Xms120g -XX:ParallelGCThreads=3" \

        GenomicsDBImport \

        --genomicsdb-workspace-path ${out_dir}/${chr}_gdb \

        -R $ref_genome \

        --batch-size 30 \

        -L $target_union \

        --interval-padding 10 \

        --sample-name-map $cohort_map \

        --tmp-dir $tmp \

        --max-num-intervals-to-import-in-parallel 3 \

        --genomicsdb-shared-posixfs-optimizations true \

        --merge-input-intervals true

    It worked for the first several batches (1-5), but with later batches it just took longer and longer to import. E.g. it took ~ 1min to import batch6, took 40min to import batch7 and took even much longer to import batch8 ...

     

    Thank you!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Mingzhou Fu, how much memory is available in the temp directory? How many intervals do you have? And, how much available physical memory is there available for this job?

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve, 

    Thank you for the timely reply! The tmp directory has an upper limit of 1TB. In terms of intervals, for one job/run, there is at most 22232 intervals. In terms of the memory, I asked for 150gb for the entire run and allocate 120gb to gatk. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great, thanks for that info. I'll speak to our GenomicsDB experts to figure out how we can get this working for you.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Mingzhou Fu I have some more questions:

    1. It looks like you are creating one workspace per chromosome, is that the case? Could you share what the workspace directory looks like to confirm that there is a single folder there? (since you are using --merge-input-intervals)
    2. For your 120GB for java and 150GB for the entire run, is that for all chromosomes or one of the chromosomes? Is it possible to monitor on your cluster to see how much memory is being used while the import is running?
    3. Can you share the complete log of the run with the command you shared above?
    4. Is there any data outside of your interval regions?
    5. Are any of your samples significantly different than the others, in terms of how much data they have or anything else? Is there a reason why some batches would be slower?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve, thank you for the timely reply! 

    1. Yes, I'm creating one workspace per chr. I have dir chr1_gdb, chr2_gdb, ... chr22_gdb. 

    cd chr1_gdb:

    drwxr-sr-x 10 mf3903 chaklab  4096 Aug 10 15:33 chr1$12180$248918434

    -rwxr-xr-x  1 mf3903 chaklab     0 Aug 10 13:02 __tiledb_workspace.tdb

    -rwxr-xr-x  1 mf3903 chaklab 23492 Aug 10 13:02 vcfheader.vcf

    -rwxr-xr-x  1 mf3903 chaklab 19696 Aug 10 13:02 vidmap.json

    2. Yes, that is for one chr. I put chr1-22 in arrays:

    #!/bin/bash

    #SBATCH --job-name=GenomicsDBImport # Job name

    #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)

    #SBATCH --mail-user=mf3903@nyu.edu # Where to send mail

    #SBATCH --nodes=1

    #SBATCH --ntasks=1 # Run on a single node

    #SBATCH --cpus-per-task=3 #this needs to align with the t (thread) needed by the software

    #SBATCH --mem=150gb # Job memory request

    #SBATCH --time=5-00:00:00 # Time limit hrs:min:sec

    #SBATCH --output=log/GenomicsDBImport_%j.log # Standard output and error log

    #SBATCH --partition=cpu_medium,cpu_long

    #SBATCH --array=1-22

    I'm not sure how much memory is exactly used for the specific command/process, since I have multiple tasks running and cannot really find the PID for the specific importing process. I used the top and  pmap -x and found the process with largest memory usage was ~19GB. 

    3. I stopped all the other chr but only kept chr1 running. Here's the log. You can see it worked for the first several bathes but took a while when it reached batch 10 and is now stuck at batch 11. It is midnight but the importing process was stuck on 7pm. 

    (base) [mf3903@bigpurple-ln2 log]$ cat GenomicsDBImport_test_14053426.log

    _START_ Tue Aug 10 16:58:47 EDT 2021

    Processing array index: 1 subsample_name for chr1

    16:58:51.951 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gpfs/share/apps/gatk/4.2.1.0/gatk/build/libs/gatk-package-4.2.1.0-2-g796af9f-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so

    Aug 10, 2021 4:58:52 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine

    INFO: Failed to detect whether we are running on Google Compute Engine.

    16:58:52.114 INFO  GenomicsDBImport - ------------------------------------------------------------

    16:58:52.114 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.2.1.0-2-g796af9f-SNAPSHOT

    16:58:52.114 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/

    16:58:52.114 INFO  GenomicsDBImport - Executing as mf3903@cn-0035 on Linux v3.10.0-693.17.1.el7.x86_64 amd64

    16:58:52.115 INFO  GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-b12

    16:58:52.115 INFO  GenomicsDBImport - Start Date/Time: August 10, 2021 4:58:51 PM EDT

    16:58:52.115 INFO  GenomicsDBImport - ------------------------------------------------------------

    16:58:52.115 INFO  GenomicsDBImport - ------------------------------------------------------------

    16:58:52.119 INFO  GenomicsDBImport - HTSJDK Version: 2.24.1

    16:58:52.119 INFO  GenomicsDBImport - Picard Version: 2.25.4

    16:58:52.119 INFO  GenomicsDBImport - Built for Spark Version: 2.4.5

    16:58:52.119 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2

    16:58:52.119 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

    16:58:52.119 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

    16:58:52.119 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

    16:58:52.119 INFO  GenomicsDBImport - Deflater: IntelDeflater

    16:58:52.119 INFO  GenomicsDBImport - Inflater: IntelInflater

    16:58:52.120 INFO  GenomicsDBImport - GCS max retries/reopens: 20

    16:58:52.120 INFO  GenomicsDBImport - Requester pays: disabled

    16:58:52.120 INFO  GenomicsDBImport - Initializing engine

    16:58:52.861 INFO  FeatureManager - Using codec IntervalListCodec to read file file:///gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/target_region/chr1.interval_list

    16:58:53.265 INFO  IntervalArgumentCollection - Processing 6099894 bp from intervals

    16:58:53.483 INFO  GenomicsDBImport - Done initializing engine

    16:58:54.014 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.1-d59e886

    16:58:54.016 INFO  GenomicsDBImport - Vid Map JSON file will be written to /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG/GenomicsDBImport/chr1_tested_gdb/vidmap.json

    16:58:54.016 INFO  GenomicsDBImport - Callset Map JSON file will be written to /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG/GenomicsDBImport/chr1_tested_gdb/callset.json

    16:58:54.016 INFO  GenomicsDBImport - Complete VCF Header will be written to /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG/GenomicsDBImport/chr1_tested_gdb/vcfheader.vcf

    16:58:54.016 INFO  GenomicsDBImport - Importing to workspace - /gpfs/data/chaklab/data/HSCR_WES_2020_MF/Psomagen_broad_Macrogen_all/combined_gVCF_Psomagen_broad_Macrogen_1KG/GenomicsDBImport/chr1_tested_gdb

    16:58:58.294 INFO  GenomicsDBImport - Importing batch 1 with 30 samples

    17:02:47.874 INFO  GenomicsDBImport - Done importing batch 1/112

    17:02:51.532 INFO  GenomicsDBImport - Importing batch 2 with 30 samples

    17:06:51.830 INFO  GenomicsDBImport - Done importing batch 2/112

    17:06:55.556 INFO  GenomicsDBImport - Importing batch 3 with 30 samples

    17:10:54.768 INFO  GenomicsDBImport - Done importing batch 3/112

    17:10:58.491 INFO  GenomicsDBImport - Importing batch 4 with 30 samples

    17:14:48.746 INFO  GenomicsDBImport - Done importing batch 4/112

    17:14:52.532 INFO  GenomicsDBImport - Importing batch 5 with 30 samples

    17:18:37.171 INFO  GenomicsDBImport - Done importing batch 5/112

    17:18:41.027 INFO  GenomicsDBImport - Importing batch 6 with 30 samples

    17:22:42.222 INFO  GenomicsDBImport - Done importing batch 6/112

    17:22:46.720 INFO  GenomicsDBImport - Importing batch 7 with 30 samples

    17:26:35.287 INFO  GenomicsDBImport - Done importing batch 7/112

    17:26:38.977 INFO  GenomicsDBImport - Importing batch 8 with 30 samples

    17:30:00.776 INFO  GenomicsDBImport - Done importing batch 8/112

    17:30:04.554 INFO  GenomicsDBImport - Importing batch 9 with 30 samples

    17:33:47.628 INFO  GenomicsDBImport - Done importing batch 9/112

    17:33:52.570 INFO  GenomicsDBImport - Importing batch 10 with 30 samples

    19:29:52.724 INFO  GenomicsDBImport - Done importing batch 10/112

    19:29:56.753 INFO  GenomicsDBImport - Importing batch 11 with 30 samples

    4. Yes, I think so. What I'm actually doing is to combine ~1000 WES samples (in gVCF) with ~2000 1KG 30xWGS samples (in gVCF) with specific intervals specified. The gVCF of my own 1000 WES samples were generated by me using haplotypecaller, and each is around 200M in size. The gVCF of the 30xWGS was generated by the new york genome center (also haplotypecaller) and were downloaded from globus or Terra. Each of the gVCF is around 6GB in size. Another thing is that I united all the target intervals with padding=10bp for my WES (since they were seq in different facilities with somewhat different target regions). I think when the actually seq was done they used bait regions/intervals, which could be a little larger than the target regions.

    5. Yes, there're samples significantly larger than others - the 1KG 30xWGS gVCFs are much larger than the gVCFs of my own WES. However I think my batches are more or less the same. I shuffled all the list of gVCF samples (my own + 1KG) before importing, and checked that my own gVCFs are well mixed with 1KG. 

     

    Thank you very much!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thank you! This is really helpful in determining why it is running so long.

    Our team thinks that the issue is coming from mixing your WGS and WES data while using --merge-input-intervals. If you had only WES samples, there would likely be no data between the intervals and so the --merge-input-intervals helps to speed things up. However, with your case, there is a lot of data from the WGS between intervals (you can also see since the the file sizes are so different). When you merge the input intervals with the WGS samples, is is probably really increasing the runtime. 

    The WGS and WES samples might not be actually shuffled equally amongst the batches because GenomicsDBImport sorts the samples lexicographically, the sample map order doesn't matter. If you want to confirm that the WGS samples are the culprits for your runtime, you can check which samples are being imported with the option --verbosity DEBUG. 

    There isn't an easy way to speed this issue up with the samples you have. Since you already got CombineGVCFs to work, you could go forward with that option and we can troubleshoot the other problem you had. Another more time consuming option would be to manually merge the intervals into as few intervals as possible (less than 100 would be best but even around 1000 would definitely help). You'll want to balance lowering the number of intervals and adding the fewest extra loci. If you do end up doing this, you can remove --merge-input-intervals true from your command.

    Hope this helps with your research, let me know if there is anything else you need.

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Mingzhou Fu

    Hi Genevieve,

    Thank you so much for the explanation! You really helped me to understand GenomicsDBImport much better :) Guess sometimes we have to live with such many imperfect data... I actually tried to reduce the #intervals but our HPC has a max # jobs limit and it was hard to balance the procedure efficiency and the wait time, as you said. 

    I will try to remove --merge-input-intervals true and add --verbosity DEBUG as you suggested. 

    Thanks again!

    Mingzhou

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk