Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

GenomicsDBImport: Attempting to genotype more than 50 alleles

Answered
1

9 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Nils Homer, I have looked into both of your requests, and unfortunately right now it is not possible to increase the number of alleles supported in GenomicsDB import. One option you might try is to look into the joint-calling WDL https://github.com/gatk-workflows. Using the gnarly genotyper (not genotype GVCFs), you will be able to run your analysis with more alleles. For your current workflow, there is not a good workaround at this point, since this limit involves more than just GATK. 

    2
    Comment actions Permalink
  • 0
    Comment actions Permalink
  • Avatar
    Juan Pablo Aguilar Cabezas

    Hi, I am in the same position. Not as many samples as Nils, but I have 37 samples that I put together with CombineGVCFs, and got the error message of more than 50 alleles when trying GenotypeGVCFs.

    What should I do? Can I subset the combined-file into like ~50/50 samples for joint-genotyping or do I have to make a combined vcf file again for each of the subsets?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Juan Pablo Aguilar Cabezas could you share the warning/error message you are getting along with your GenotypeGVCFs command line and GATK version? 

    0
    Comment actions Permalink
  • Avatar
    Juan Pablo Aguilar Cabezas

    Hi, my GATK version is v4.2.6.1
    Everything goes normal at the beginning...

    06:32:00.602 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    06:32:00.759 INFO  GenotypeGVCFs - ------------------------------------------------------------
    06:32:00.760 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.6.1
    06:32:00.760 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    06:32:00.760 INFO  GenotypeGVCFs - Executing as jpac1984@p0731.ten.osc.edu on Linux v3.10.0-1160.71.1.el7.x86_64 amd64
    06:32:00.760 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
    06:32:00.761 INFO  GenotypeGVCFs - Start Date/Time: August 5, 2022 6:32:00 AM GMT
    06:32:00.761 INFO  GenotypeGVCFs - ------------------------------------------------------------
    06:32:00.761 INFO  GenotypeGVCFs - ------------------------------------------------------------
    06:32:00.761 INFO  GenotypeGVCFs - HTSJDK Version: 2.24.1
    06:32:00.761 INFO  GenotypeGVCFs - Picard Version: 2.27.1
    06:32:00.762 INFO  GenotypeGVCFs - Built for Spark Version: 2.4.5
    06:32:00.762 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    06:32:00.762 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    06:32:00.762 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    06:32:00.762 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    06:32:00.762 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    06:32:00.762 INFO  GenotypeGVCFs - Inflater: IntelInflater
    06:32:00.762 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
    06:32:00.762 INFO  GenotypeGVCFs - Requester pays: disabled
    06:32:00.762 INFO  GenotypeGVCFs - Initializing engine
    06:32:01.062 INFO  FeatureManager - Using codec VCFCodec to read file file:///fs/scratch/PHS0338/appz/sam-bams/combine.8-03.vcf.gz
    06:32:01.219 INFO  GenotypeGVCFs - Done initializing engine
    06:32:01.249 INFO  ProgressMeter - Starting traversal
    06:32:01.249 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    06:32:11.264 INFO  ProgressMeter - scaffold_m19_p_1_polished:216602              0.2                138000         826759.9
    06:32:21.289 INFO  ProgressMeter - scaffold_m19_p_1_polished:430071              0.3                287000         859281.4
    06:32:31.350 INFO  ProgressMeter - scaffold_m19_p_1_polished:629602              0.5                407000         811268.7
    06:32:41.358 INFO  ProgressMeter - scaffold_m19_p_1_polished:803478              0.7                524000         783864.0
    06:32:51.423 INFO  ProgressMeter - scaffold_m19_p_1_polished:986652              0.8                661000         790449.2
    06:33:01.464 INFO  ProgressMeter - scaffold_m19_p_1_polished:1176581              1.0                810000         807107.9
    06:33:11.514 INFO  ProgressMeter - scaffold_m19_p_1_polished:1373664              1.2                970000         828292.9
    06:33:21.525 INFO  ProgressMeter - scaffold_m19_p_1_polished:1580757              1.3               1133000         846828.4
    06:33:31.603 INFO  ProgressMeter - scaffold_m19_p_1_polished:1761222              1.5               1240000         823427.9
    06:33:41.651 INFO  ProgressMeter - scaffold_m19_p_1_polished:1933731              1.7               1354000         809147.2
    06:33:51.668 INFO  ProgressMeter - scaffold_m19_p_1_polished:2093926              1.8               1456000         791168.2

    06:45:24.387 INFO  ProgressMeter - scaffold_m19_p_1_polished:11647941             13.4              10073000         752523.2
    06:45:34.444 INFO  ProgressMeter - scaffold_m19_p_1_polished:11760892             13.6              10182000         751258.9
    06:45:42.314 WARN  MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:11852407
    06:45:44.507 INFO  ProgressMeter - scaffold_m19_p_1_polished:11877318             13.7              10295000         750311.6
    06:45:49.862 WARN  MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:11944851
    06:45:54.541 INFO  ProgressMeter - scaffold_m19_p_1_polished:12006213             13.9              10411000         749630.1
    06:46:04.574 INFO  ProgressMeter - scaffold_m19_p_1_polished:12129463             14.1              10531000         749248.5


    08:02:42.247 INFO  ProgressMeter - scaffold_m19_p_1_polished:76429628             90.7              69193000         763018.1
    08:02:52.275 INFO  ProgressMeter - scaffold_m19_p_1_polished:76566508             90.9              69314000         762946.3
    08:03:02.338 WARN  MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:76703111
    08:03:02.352 INFO  ProgressMeter - scaffold_m19_p_1_polished:76703358             91.0              69443000         762955.8
    08:03:12.368 INFO  ProgressMeter - scaffold_m19_p_1_polished:76857639             91.2              69578000         763039.5


    09:04:28.471 INFO  ProgressMeter - scaffold_m19_p_1_polished:132040753            152.5             117758000         772418.1
    09:04:38.508 INFO  ProgressMeter - scaffold_m19_p_1_polished:132181643            152.6             117884000         772397.1
    09:04:48.534 INFO  ProgressMeter - scaffold_m19_p_1_polished:132322502            152.8             118014000         772403.3
    09:04:48.595 WARN  MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:132323432
    09:04:58.613 INFO  ProgressMeter - scaffold_m19_p_1_polished:132460248            153.0             118139000         772372.1
    09:05:08.626 INFO  ProgressMeter - scaffold_m19_p_1_polished:132605483            153.1             118261000         772327.1


    09:18:21.906 INFO  ProgressMeter - scaffold_m19_p_1_polished:143923857            166.3             128415000         771983.2
    09:18:31.964 INFO  ProgressMeter - scaffold_m19_p_1_polished:144059322            166.5             128542000         771968.8
    09:18:42.042 INFO  ProgressMeter - scaffold_m19_p_1_polished:144198649            166.7             128674000         771982.8
    09:18:50.470 WARN  MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:144314281
    09:18:52.092 INFO  ProgressMeter - scaffold_m19_p_1_polished:144337435            166.8             128800000         771963.0
    09:19:02.124 INFO  ProgressMeter - scaffold_m19_p_1_polished:144478382            167.0             128933000         771986.5

     

    09:28:03.781 WARN  MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location scaffold_m19_p_1_polished:151982426

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Juan Pablo Aguilar Cabezas

    Sorry the command is:

    singularity exec /users/PHS0338/jpac1984/appz/gatk_latest.sif gatk --java-options "-Xmx145g" GenotypeGVCFs \
       -R myse-hapog.fasta \
       -V combine.8-03.vcf.gz \
       -O gvcf.8-03.vcf.gz

    0
    Comment actions Permalink
  • 0
    Comment actions Permalink
  • Avatar
    Juan Pablo Aguilar Cabezas

    Genevieve Brandt (she/her) You are welcome!

    Yes, I know that it is not an error but since I am doing demographic inference, and those sites will be skipped, right? Then, I will have less SNPs and variant information that is the goal of my study, since many sites are skipped. That is why I asked about making subsets for joint genotyping that will be merged latter.

    Any suggestions on how to make subsets to avoid those warnings?

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Most likely the reason that you have so many genotypes as these sites is because the data is low quality there. With 37 samples, I wouldn't expect that high quality variant sites would have over 50 genotypes.

    If you decrease the number of samples in your GVCF, you are losing computational power for genotyping. I wouldn't recommend that solution so you might want to spend some time to determine why you have so many genotypes at these sites. 

    If you do decide to break up the GVCF, you can do this with SelectVariants and select a subset of your samples with --sample-name: https://gatk.broadinstitute.org/hc/en-us/articles/5358856605339-SelectVariants

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk