GenomicsDBImport: Attempting to genotype more than 50 alleles
AnsweredI'm joint calling ~10K samples, where I expected there to be MANY possible alleles/genotypes. How do I increase the # of alleles supported by GenomicsDBImport?
Perhaps this one?
--genomicsdb-segment-size?
a) GATK version used
The Genome Analysis Toolkit (GATK) v4.1.8.1
HTSJDK Version: 2.23.0
Picard Version: 2.22.8
b) Exact GATK commands used
gatk \
--java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx16g" \
--spark-runner LOCAL \
GenomicsDBImport \
-L <redacted> \
--genomicsdb-workspace-path /path/to/db \
--arguments_file args.txt;
-
It looks like HTSJDK has the limit:
Here's the place GATK checks:
But someone thought it might be a command line argument in the future, but didn't wire it to the above the check above:
https://github.com/broadinstitute/gatk/blob/bc0994c180312cdca7afbe45b410b2c6fc312043/src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBArgumentCollection.java#L18-L22 -
Hi Nils Homer, I have looked into both of your requests, and unfortunately right now it is not possible to increase the number of alleles supported in GenomicsDB import. One option you might try is to look into the joint-calling WDL https://github.com/gatk-workflows. Using the gnarly genotyper (not genotype GVCFs), you will be able to run your analysis with more alleles. For your current workflow, there is not a good workaround at this point, since this limit involves more than just GATK.
-
Hi, I am in the same position. Not as many samples as Nils, but I have 37 samples that I put together with CombineGVCFs, and got the error message of more than 50 alleles when trying GenotypeGVCFs.
What should I do? Can I subset the combined-file into like ~50/50 samples for joint-genotyping or do I have to make a combined vcf file again for each of the subsets?
-
Juan Pablo Aguilar Cabezas could you share the warning/error message you are getting along with your GenotypeGVCFs command line and GATK version?
-
Hi, my GATK version is v4.2.6.1
Everything goes normal at the beginning...06:32:00.602 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
06:32:00.759 INFO GenotypeGVCFs - ------------------------------------------------------------
06:32:00.760 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.6.1
06:32:00.760 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
06:32:00.760 INFO GenotypeGVCFs - Executing as jpac1984@p0731.ten.osc.edu on Linux v3.10.0-1160.71.1.el7.x86_64 amd64
06:32:00.760 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
06:32:00.761 INFO GenotypeGVCFs - Start Date/Time: August 5, 2022 6:32:00 AM GMT
06:32:00.761 INFO GenotypeGVCFs - ------------------------------------------------------------
06:32:00.761 INFO GenotypeGVCFs - ------------------------------------------------------------
06:32:00.761 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
06:32:00.761 INFO GenotypeGVCFs - Picard Version: 2.27.1
06:32:00.762 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
06:32:00.762 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
06:32:00.762 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
06:32:00.762 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
06:32:00.762 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
06:32:00.762 INFO GenotypeGVCFs - Deflater: IntelDeflater
06:32:00.762 INFO GenotypeGVCFs - Inflater: IntelInflater
06:32:00.762 INFO GenotypeGVCFs - GCS max retries/reopens: 20
06:32:00.762 INFO GenotypeGVCFs - Requester pays: disabled
06:32:00.762 INFO GenotypeGVCFs - Initializing engine
06:32:01.062 INFO FeatureManager - Using codec VCFCodec to read file file:///fs/scratch/PHS0338/appz/sam-bams/combine.8-03.vcf.gz
06:32:01.219 INFO GenotypeGVCFs - Done initializing engine
06:32:01.249 INFO ProgressMeter - Starting traversal
06:32:01.249 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
06:32:11.264 INFO ProgressMeter - scaffold_m19_p_1_polished:216602 0.2 138000 826759.9
06:32:21.289 INFO ProgressMeter - scaffold_m19_p_1_polished:430071 0.3 287000 859281.4
06:32:31.350 INFO ProgressMeter - scaffold_m19_p_1_polished:629602 0.5 407000 811268.7
06:32:41.358 INFO ProgressMeter - scaffold_m19_p_1_polished:803478 0.7 524000 783864.0
06:32:51.423 INFO ProgressMeter - scaffold_m19_p_1_polished:986652 0.8 661000 790449.2
06:33:01.464 INFO ProgressMeter - scaffold_m19_p_1_polished:1176581 1.0 810000 807107.9
06:33:11.514 INFO ProgressMeter - scaffold_m19_p_1_polished:1373664 1.2 970000 828292.9
06:33:21.525 INFO ProgressMeter - scaffold_m19_p_1_polished:1580757 1.3 1133000 846828.4
06:33:31.603 INFO ProgressMeter - scaffold_m19_p_1_polished:1761222 1.5 1240000 823427.9
06:33:41.651 INFO ProgressMeter - scaffold_m19_p_1_polished:1933731 1.7 1354000 809147.2
06:33:51.668 INFO ProgressMeter - scaffold_m19_p_1_polished:2093926 1.8 1456000 791168.206:45:24.387 INFO ProgressMeter - scaffold_m19_p_1_polished:11647941 13.4 10073000 752523.2
06:45:34.444 INFO ProgressMeter - scaffold_m19_p_1_polished:11760892 13.6 10182000 751258.9
06:45:42.314 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:11852407
06:45:44.507 INFO ProgressMeter - scaffold_m19_p_1_polished:11877318 13.7 10295000 750311.6
06:45:49.862 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:11944851
06:45:54.541 INFO ProgressMeter - scaffold_m19_p_1_polished:12006213 13.9 10411000 749630.1
06:46:04.574 INFO ProgressMeter - scaffold_m19_p_1_polished:12129463 14.1 10531000 749248.5
08:02:42.247 INFO ProgressMeter - scaffold_m19_p_1_polished:76429628 90.7 69193000 763018.1
08:02:52.275 INFO ProgressMeter - scaffold_m19_p_1_polished:76566508 90.9 69314000 762946.3
08:03:02.338 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:76703111
08:03:02.352 INFO ProgressMeter - scaffold_m19_p_1_polished:76703358 91.0 69443000 762955.8
08:03:12.368 INFO ProgressMeter - scaffold_m19_p_1_polished:76857639 91.2 69578000 763039.5
09:04:28.471 INFO ProgressMeter - scaffold_m19_p_1_polished:132040753 152.5 117758000 772418.1
09:04:38.508 INFO ProgressMeter - scaffold_m19_p_1_polished:132181643 152.6 117884000 772397.1
09:04:48.534 INFO ProgressMeter - scaffold_m19_p_1_polished:132322502 152.8 118014000 772403.3
09:04:48.595 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:132323432
09:04:58.613 INFO ProgressMeter - scaffold_m19_p_1_polished:132460248 153.0 118139000 772372.1
09:05:08.626 INFO ProgressMeter - scaffold_m19_p_1_polished:132605483 153.1 118261000 772327.1
09:18:21.906 INFO ProgressMeter - scaffold_m19_p_1_polished:143923857 166.3 128415000 771983.2
09:18:31.964 INFO ProgressMeter - scaffold_m19_p_1_polished:144059322 166.5 128542000 771968.8
09:18:42.042 INFO ProgressMeter - scaffold_m19_p_1_polished:144198649 166.7 128674000 771982.8
09:18:50.470 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:144314281
09:18:52.092 INFO ProgressMeter - scaffold_m19_p_1_polished:144337435 166.8 128800000 771963.0
09:19:02.124 INFO ProgressMeter - scaffold_m19_p_1_polished:144478382 167.0 128933000 771986.509:28:03.781 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:151982426
-
Sorry the command is:
singularity exec /users/PHS0338/jpac1984/appz/gatk_latest.sif gatk --java-options "-Xmx145g" GenotypeGVCFs \
-R myse-hapog.fasta \
-V combine.8-03.vcf.gz \
-O gvcf.8-03.vcf.gz -
Juan Pablo Aguilar Cabezas thank you for sharing! These are just warnings, not errors. Please see this forum post for more details: https://gatk.broadinstitute.org/hc/en-us/community/posts/360061038732-WARN-MinimalGenotypingEngine-Attempting-to-genotype-more-than-50-alleles-Site-will-be-skipped-at-location-chr11-294512
-
Genevieve Brandt (she/her) You are welcome!
Yes, I know that it is not an error but since I am doing demographic inference, and those sites will be skipped, right? Then, I will have less SNPs and variant information that is the goal of my study, since many sites are skipped. That is why I asked about making subsets for joint genotyping that will be merged latter.
Any suggestions on how to make subsets to avoid those warnings?
-
Most likely the reason that you have so many genotypes as these sites is because the data is low quality there. With 37 samples, I wouldn't expect that high quality variant sites would have over 50 genotypes.
If you decrease the number of samples in your GVCF, you are losing computational power for genotyping. I wouldn't recommend that solution so you might want to spend some time to determine why you have so many genotypes at these sites.
If you do decide to break up the GVCF, you can do this with SelectVariants and select a subset of your samples with --sample-name: https://gatk.broadinstitute.org/hc/en-us/articles/5358856605339-SelectVariants
-
Hi Genevieve Brandt (she/her): I'm running the GATK joint genotyping WARP pipeline, using GATK predefined interval list on 174 human samples. While I get multiple regions with multiple genotypes I get the below exception on one sample (12345) on a bunch of sites on chrX what do you suggest i do ?
java.lang.IllegalStateException: Genotype has no likelihoods: [12345 C*/T GQ 35 DP 3 AD 1,2,0,0,0,0 {SB=[0, 1, 1, 1]}] at org.broadinstitute.hellbender.utils.GenotypeUtils.computeDiploidGenotypeCounts(GenotypeUtils.java:89) at org.broadinstitute.hellbender.tools.walkers.annotator.ExssHet.calculateEH(ExcessHet.java:96) at org.broadinstitute.hellbender.tools.walkers.annotator.ExcessHet.annotate(ExcessHet.java:84) at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.addInfoAnnotations(VariantAnnotatorEngine.java:357) at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:335) at org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotatorEngine.annotateContext(VariantAnnotatorEngine.java:306) at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.regenotypeVC(GenotypeGVCFsEngine.java:188) at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:138) at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:287) at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:490) at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1095) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289) Using GATK jar /gatk/gatk-package-4.3.0.0-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms8000m -Xmx25000m -jar /gatk/gatk-package-4.3.0.0-local.jar GenotypeGVCFs -R /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -O wgs_joint_genotyping_adu-sed.64.vcf.gz -D gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf -G StandardAnnotation -G AS_StandardAnnotation --only-output-calls-starting-in-intervals -V gendb://genomicsdb -L gs://batch1/JointGenotyping/265/call-SplitIntervalList/glob-d928cd0b/0064-scattered.interval_list --allow-old-rms-mapping-quality-annotation-data --merge-input-intervals
below are the intervals on the interval list which is telomericchrX 114331199 115738949 + .
chrX 115838950 116557779 + .
chrX 116595567 120879381 + .
chrX 120929382 144425606 + .
chrX 144475607 156030895 + .
Thanks
Archana
-
Hi anr,
This is a surprise to me, especially since the AD indicates 5 ALT alleles, which is manageable. How were the GVCFs generated? Was the --max-alt-alleles parameter changed from the default? It's going to be hard to process this variant with no likelihoods, so you can try to regenerate the GVCF to fix the issue, or exclude the site from the callset by modifying your -L intervals.
-Laura
Please sign in to leave a comment.
11 comments