Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenotypeGVCFs stalls while using --all-sites

Answered
0

37 comments

  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay have you tried increasing your memory allocation to be larger (--java-options "-Xmx8G") ? If there is an issue with memory allocation another trick is to specify a temporary directory with the option --tmp-dir

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve-Brandt-she-her - thanks for the response! We've tried with up to 96G for java and both with and without the --tmp-dir option. Any other ideas?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay since it stalls at the same location each time, are there many alleles at that location? It might be useful to check out DepthofCoverage or another tool and look at that area in the genome. 

    Another thing you can try is to update your GATK version since we continually improve the tools.

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve-Brandt-she-her, I don't think it's the location, since we stall at the exact same position (200000) when we try a different chromosome. I've also tried updating my GATK version and same issue. We're really stuck here, any help would be great!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay have you had any successful runs using --all-sites with any of your chromosomes? Or do they all fail at position 200000?

    Does the job fail, or just stall? And for how long?

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Genevieve-Brandt-she-her I have only tried the first 2 chromosomes, but they both stalled at position 200000. The jobs do not actually fail, they just stall. They reach 200000 quickly (an hour or so) and I let them run for 24 hours.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay thank you for this info. I am continuing to look into solutions and will let you know when I get more information.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Rachael Bay

    Sometimes early positions in the chromosome do not have data and we are wondering if this is an issue with writing an enormous amount of information once it starts outputting data.

    • Could you try writing to a .gz VCF file? It should have better performance.
    • How big is your genomicsDB, how many samples are you working with?
    • Could you give more information about your machine that you are running this on? How much memory is available to the machine and is it the exact amount that you are giving for the java memory? The machine will need slightly more memory than you give in the java options to account for the C++ memory.
    • Could you start in the middle of the chromosome and see if it runs? Perhaps from position 500,000 to 1,000,000
    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve-Brandt-she-her,

    *I tried writing to a .gz file - same issue

    *The genomicsDB directory is large - looks like 793G. We have 471 individuals.

    *I am running on the Bridges cluster through the NSF XSEDE program. I have tried both regular memory and large memory. On large memory I requested 128G from the machine while asking for 96 in the java options, so there should be extra memory there.

    *Does the genomicsDB have to be remade to start in the middle of the chromosome? I tried specifying the interval you suggested above with Haplotype caller, but it JUST stalled (never got the progress meter showing position)

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Rachael Bay, thank you for those updates. I have some follow up questions to try to get to the bottom of this. 

    • We want to determine if this is an issue with the genomicsDB or GenotypeGVCFs with --all-sites. Could you run SelectVariants at position 200k-201k and see if it gets any results?
    • What species are you working with? Is there anything weird happening at the reference at position 200k?
    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Genevieve-Brandt-she-her. When I look at that region, it actually looks like the last position is 1999988. This is exactly the same across multiple chromosomes (I have tried three of six). I am working with Zostera marina, but I don't really see why there would be something weird happening at that position in multiple chromosomes.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay 

    Did you run SelectVariants yet?

    I am confused about what you mean by the last position is 1999988, since we are looking into position 200K? Am I misunderstanding, or did you possibly make a typo?

     

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Genevieve-Brandt-she-her Sorry, it was a typo sort of. Just looking right now. The last position recorded for Chr01 is 199988 and for Chr03 is 199998, so not EXACTLY the same position. Let me know if it would be at all useful for me to try this on the other chromosomes. I'm really at a loss for what else to try!

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Also, I guess I'm confused about how running SelectVariants would see anything different since it would just return a subset of the raw vcf file, right?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay thanks for clarifying! 

    Running SelectVariants, though not helpful for your final result, helps us in our troubleshooting of this issue. We are trying to determine if this runtime issue is from the GenomicsDB or if it is from GenotypeGVCFs. 

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Genevieve-Brandt-she-her thanks, that make sense. When I run SelectVariants for Chr01:200000-201000 it stalls with no variants output.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Rachael Bay thanks for the update, I'll bring this up with my team and let you know when I have more information

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Rachael Bay, here are a few more things that you could check that would give us more information:

    1. While you run GenotypeGVCFs and it is stalled, could you inspect the java process and what is going on? The command is jstack with the process ID: (https://docs.oracle.com/en/java/javase/13/docs/specs/man/jstack.html). Look at the jstack a few different times to try to figure out what is going on while it is stalled.
    2. Could you give more information about how you ran the import to your GenomicsDB? Was it in batches?
    3. Could you create a new GenomicsDB workspace with all the individuals on a small interval, for example 195,000-200,000 and see if the same issue persists?

    And something to try for a possible improvement:

    1. What version of GATK did you use to create the GenomicsDB? We have been improving GenomicsDB quite a bit and our most up to date version is 4.1.9.0. If you re-run the import you may see better performance and it may help the GenotypeGVCFs process. [You will need to use the option --genomicsdb-shared-posixfs-optimizations for your cluster]
    0
    Comment actions Permalink
  • Avatar
    Andrius Jonas Dagilis

    Hi,

    Reviving this thread since I've run into a nearly identical issue. I am running GATK 4.1.9.0, and have a data-set of 807 flies for which I am trying to genotype individual chromosomes using the all-sites option. Each chromosome is in a separate database folder, and each chromosome stalls roughly 200,000 bases in. I'm running the following commands for GenotypeGVCF:

    gatk --java-options "-Xmx190g" GenotypeGVCFs \
    -all-sites \
    -R /proj/matutelb/projects/drosophila/melanogaster/dmel6_ref.fasta \
    -V gendb://all_mels_chr2L \
    -L chr2L \
    -O all_mels_chr2L.vcf.gz

    This is on a large memory node on our cluster, for which I request 200g of memory to give some overhead. 

    I ran GenomicsDBImport in batches of 50, have not yet tried subsampling the chromosomes into smaller chunks for database/genotypegvcf. Currently trying to login to the nodes and take a look at jstack, will update once I have some results.

    Mainly reviving the thread to see if a solution was ever found.

    Cheers,

    Andrius J. Dagilis

    0
    Comment actions Permalink
  • Avatar
    Rachael Bay

    Hi Andrius,

    Sorry, this is really frustrating and I'm not sure I have a good answer for you. I did finally get it to work, but only because our lab switched to a new cluster. It then ran without any issue. I hope you can find a solution!

    Best,

    Rachael

    1
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update Rachael Bay!

    Andrius Jonas Dagilis my first recommendation would be use a lower percentage of the available physical memory for your GenotypeGVCFs Java Xmx value. Since you have allocated 200G, you'll want your Xmx value between 160-180. 

    We now have an article with all of our performance recommendations using GenomicsDB here: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GenomicsDBImport-usage-and-performance-guidelines

     

    0
    Comment actions Permalink
  • Avatar
    Andrius Jonas Dagilis

    Thanks Genevieve-Brandt-she-her

    I've started another job using a different amount of memory, but since it's still in queue, here's the jstack output from one of the jobs:

    Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed mode):

    "Attach Listener" #11 daemon prio=9 os_prio=0 tid=0x00007fa394001000 nid=0x1dbb6 waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "pool-2-thread-1" #10 prio=5 os_prio=0 tid=0x00007fd3653a4000 nid=0x48e6 runnable [0x00007fa398188000]
    java.lang.Thread.State: RUNNABLE
    at org.genomicsdb.importer.GenomicsDBImporterJni.jniImportBatch(Native Method)
    at org.genomicsdb.importer.GenomicsDBImporter.doSingleImport(GenomicsDBImporter.java:588)
    at org.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:703)
    at org.genomicsdb.importer.GenomicsDBImporter$$Lambda$84/1713988669.get(Unknown Source)
    at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    "Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00007fd36414a000 nid=0x4661 runnable [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007fd364147000 nid=0x4660 waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007fd364145800 nid=0x465f waiting on condition [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007fd364136800 nid=0x465e runnable [0x0000000000000000]
    java.lang.Thread.State: RUNNABLE

    "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fd36410a000 nid=0x465d in Object.wait() [0x00007fa3ac65e000]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00007fb3aa15bc70> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
    - locked <0x00007fb3aa15bc70> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)

    "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fd364105800 nid=0x4659 in Object.wait() [0x00007fa3ac75f000]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00007fb3aa1596d8> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00007fb3aa1596d8> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

    "main" #1 prio=5 os_prio=0 tid=0x00007fd364054800 nid=0x4652 waiting on condition [0x00007fd36bc0e000]
    java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x00007fb3aae4fad8> (a java.util.concurrent.CompletableFuture$Signaller)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
    at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
    at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
    at org.genomicsdb.importer.GenomicsDBImporter$$Lambda$85/1662312252.apply(Unknown Source)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
    at org.genomicsdb.importer.GenomicsDBImporter.iterateOverSamplesInBatches(GenomicsDBImporter.java:679)
    at org.genomicsdb.importer.GenomicsDBImporter.executeImport(GenomicsDBImporter.java:637)
    at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:748)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1049)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    "VM Thread" os_prio=0 tid=0x00007fd3640fb800 nid=0x4656 runnable

    "VM Periodic Task Thread" os_prio=0 tid=0x00007fd36414c800 nid=0x4662 waiting on condition

    JNI global references: 428


    Out of curiosity, I also ran a job without including all-sites, and that seems to be running just fine, currently well past 5 Mb, so definitely an issue with the -all-sites option specifically. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Andrius Jonas Dagilis Let me know what you find with the run that is going! Hopefully the memory change will help. 

    It makes sense that -all-sites would greatly affect the resources needed. It usually adds a significant number of sites for analysis.

    0
    Comment actions Permalink
  • Avatar
    Andrius Jonas Dagilis

    Hey Genevieve-Brandt-she-her, I let a few different jobs try and run over the weekend and they have all stalled at 200kb in. I have tried:

    Giving the node more memory overhead (190 for GATK, 260 for the job)
    Giving more overhead with the same memory (160 for GATK, 200 for the job)
    Creating a region database for a smaller portion of a chromosome (the first 10Mb)

    And the one thing that worked, but doesn't seem like a good solution:
    Use only half of the individuals to create a database. 

    I'm tempted to split the calling into separate calls for different geographic regions, but we're asking, in part, how much gene-flow between populations there is, so joint-calling would be preferred. 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update Andrius Jonas Dagilis. There have been some GenomicsDB changes in the most recent versions of GATK (we are currently on 4.2.4.1) that might help you out here. First, use the option --genomicsdb-shared-posixfs-optimizations, which can help performance in shared filesystems. 

    Another option to make sure you can keep all the individuals would be to decrease the --max-alternate-alleles option in GenotypeGVCFs. The default is 6. If you are using -all-sites and many of your sites have a high number of alternate alleles, this can cause major slow downs with GenotypeGVCFs. One caveat to this option: this option was broken before 4.2.4.1 so you will definitely have to update)

    0
    Comment actions Permalink
  • Avatar
    Andrius Jonas Dagilis

    Updating to 4.2.4.1 has at least changed the errors, so that's something.

    I created the databases using the -genomicsdb-shared-poixfs-optimizations parameter, and attempted to run GenotypeGVCFs using several different cutoffs for the --max-alternate-alleles option, going from 5, to 4, to 3. In all cases, I am still getting errors and crashing out, but now for entirely new reasons. For each chromosome, I get an error of the type:


    Genotype [LA69_2nd_aliq AGTGAGACCAATATCTAGG|AGTGAGACCAATATCTAGG GQ 3 DP 1 AD 0,0,1,0,0,0,0 {PGT=0|1, PID=5669_G_A, PS=5669, SB=0,0,0,1}] does not contain likelihoods necessary to calculate posteriors.


    At first I thought perhaps this info was missing for a particular individual and therefore causing errors. However, this information seems to be missing for different individuals for different chromosomes, and changes depending on the --max-alternate-alleles option I use.

    One big caveat - I have not re-generated the gvcfs for each sample using GATK 4.2.4.1, so just in case that's at fault I am re-running HaplotypeCaller in case that makes a difference.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Andrius Jonas Dagilis can you post the entire program log that contains that error? [You can delete some lines if there are too many duplicates, but it would be helpful to see the complete output]

    0
    Comment actions Permalink
  • Avatar
    Andrius Jonas Dagilis

    Here you go - tried running without the --max-alternate-alleles option and had the same result, so I don't think that option is causing it.

    Using GATK jar /nas/longleaf/apps/gatk/4.2.4.1/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx190g -jar /nas/longleaf/apps/gatk/4.2.4.1/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar GenotypeGVCFs -R /proj/matutelb/projects/drosophila/melanogaster/dmel6_ref.fasta -V gendb://all_mels_chr2L -L chr2L -O all_mels_chr2L.vcf.gz
    19:40:48.803 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nas/longleaf/apps/gatk/4.2.4.1/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Jan 13, 2022 7:40:50 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    19:40:50.088 INFO GenotypeGVCFs - ------------------------------------------------------------
    19:40:50.089 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.4.1
    19:40:50.090 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    19:40:50.102 INFO GenotypeGVCFs - Executing as adagilis@t0601.ll.unc.edu on Linux v3.10.0-1160.2.2.el7.x86_64 amd64
    19:40:50.103 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_292-b10
    19:40:50.103 INFO GenotypeGVCFs - Start Date/Time: January 13, 2022 7:40:48 PM EST
    19:40:50.103 INFO GenotypeGVCFs - ------------------------------------------------------------
    19:40:50.103 INFO GenotypeGVCFs - ------------------------------------------------------------
    19:40:50.104 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
    19:40:50.104 INFO GenotypeGVCFs - Picard Version: 2.25.4
    19:40:50.104 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
    19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    19:40:50.105 INFO GenotypeGVCFs - Deflater: IntelDeflater
    19:40:50.105 INFO GenotypeGVCFs - Inflater: IntelInflater
    19:40:50.105 INFO GenotypeGVCFs - GCS max retries/reopens: 20
    19:40:50.105 INFO GenotypeGVCFs - Requester pays: disabled
    19:40:50.105 INFO GenotypeGVCFs - Initializing engine
    19:40:51.727 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
    19:41:10.142 info NativeGenomicsDB - pid=29116 tid=29117 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    19:41:10.143 info NativeGenomicsDB - pid=29116 tid=29117 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    19:41:10.143 info NativeGenomicsDB - pid=29116 tid=29117 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    19:41:28.066 INFO IntervalArgumentCollection - Processing 23513712 bp from intervals
    19:41:28.137 INFO GenotypeGVCFs - Done initializing engine
    19:41:28.487 INFO ProgressMeter - Starting traversal
    19:41:28.488 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    19:42:19.471 INFO ProgressMeter - chr2L:5364 0.8 1000 1176.9
    19:42:32.573 INFO ProgressMeter - chr2L:6364 1.1 2000 1872.5
    19:42:48.677 INFO ProgressMeter - chr2L:8364 1.3 4000 2992.9
    19:43:03.248 INFO ProgressMeter - chr2L:10364 1.6 6000 3799.1
    Chromosome chr2L position 11353 (TileDB column 11352) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    19:43:18.779 INFO ProgressMeter - chr2L:12364 1.8 8000 4352.1
    Chromosome chr2L position 13464 (TileDB column 13463) has too many alleles in the combined VCF record : 11 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    19:43:32.483 INFO ProgressMeter - chr2L:14364 2.1 10000 4838.9
    19:43:43.998 INFO ProgressMeter - chr2L:16364 2.3 12000 5313.3
    Chromosome chr2L position 16479 (TileDB column 16478) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    Chromosome chr2L position 17004 (TileDB column 17003) has too many alleles in the combined VCF record : 13 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    19:43:57.857 INFO ProgressMeter - chr2L:17364 2.5 13000 5222.0
    Chromosome chr2L position 18754 (TileDB column 18753) has too many alleles in the combined VCF record : 9 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    Chromosome chr2L position 19311 (TileDB column 19310) has too many alleles in the combined VCF record : 16 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    19:44:17.535 INFO ProgressMeter - chr2L:19364 2.8 15000 5324.0
    Chromosome chr2L position 19835 (TileDB column 19834) has too many alleles in the combined VCF record : 11 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    19:44:34.904 INFO ProgressMeter - chr2L:21364 3.1 17000 5471.6
    19:44:47.867 INFO ProgressMeter - chr2L:23364 3.3 19000 5717.8
    Chromosome chr2L position 25349 (TileDB column 25348) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

    19:45:00.952 INFO GenotypeGVCFs - Shutting down engine
    GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),57.82864317899994,Cpu time(s),49.25545264700008
    [January 13, 2022 7:45:01 PM EST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 4.22 minutes.
    Runtime.totalMemory()=2076049408
    java.lang.IllegalStateException: Genotype [Ark CTTT/CTTT GQ 24 DP 8 AD 0,8,0,0,0,0,0,0 {SB=0,0,4,4}] does not contain likelihoods necessary to calculate posteriors.
    at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.log10NormalizedGenotypePosteriors(AlleleFrequencyCalculator.java:89)
    at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.effectiveAlleleCounts(AlleleFrequencyCalculator.java:258)
    at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.calculate(AlleleFrequencyCalculator.java:141)
    at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:147)
    at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.calculateGenotypes(GenotypeGVCFsEngine.java:244)
    at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.regenotypeVC(GenotypeGVCFsEngine.java:152)
    at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:135)
    at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:283)
    at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    (base) [adagilis@longleaf-login4 logs]$
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:490)
    at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Andrius Jonas Dagilis,

    Thanks for sharing this. It seems like there might be a bug in this newest GenotypeGVCFs version so I have created a ticket on the GATK github repo: https://github.com/broadinstitute/gatk/issues/7639. I'll follow up next week regarding what the team finds.

    Are you able to test to see what version this problem originated in? This would help us to narrow down the problem.

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Eric C. Anderson

    Hi Genevieve, thank you for working on this. And thank you Andrius for posting the error.

    I have the same problem.  It is not just associated with --all-sites. The error seems to occur when the number of alleles is exactly one greater than the value of ` --max-alternate-alleles`.  Note that is the case in Andrius's output: it throws a warning, but no error when there are more than 7 alleles, but when there are exactly 7 alleles (one more than the default max of 6) it errors out. 

    I have tried this on a number of different chromosomes (small test case with 4 Chinook salmon) and always get the error immediately after a warning about having 7 alleles.  On one chromosome, an earlier warning about 8 alleles causes no error, UNLESS I set  `--max-alternate-alleles` to 7, in which case, the position with 8 alleles causes the error immediately after the warning.

    I will copy this the GitHub issue.  It seems like it should be pretty easy to find and correct this bug, though I couldn't find it in the source code.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk