GenomicsDBImport error - non-standard non-IUPAC base
Hi there,
I am trying to output a multisample VCF from a genomicsDB.
The GenomicsDB was created with 108 g.vcf files, in turn generated using Clara Parabrick's accelerated germline pipeline (v4.0.1). The command used is:
#to make the DB
gatk --java-options -Xmx150G GenomicsDBImport --genomicsdb-workspace-path germline_0323 --sample-name-map map.map -L intervals.list --reader-threads 20 --max-num-intervals-to-import-in-parallel 3
#to make final VCF
gatk GenotypeGVCFs -R ../ref/Genome.fasta -V gendb://germline_0323 -O output.vcf --max-alternate-alleles 5
The bottom ~100 lines of the log (it's very long) is below. When I try and run GenotypeGVCFs on a merged GVCF (the old way) the same error happens. In a very close position (CM023248:28015217 vs CM023248:28016217).
I have searched the genome for non-IUPAC characters and found none. Similarly, when I manually examine the few 1000BP in this region, all of the characters are ACTG.
Do you think it may be to do with the warnings about there being too many alternative alleles?
GATK version is 4.3.0
Any help would be greatly appreciated - thank you!
Tristan
Log here:
Sample/Callset C1aSud049( TileDB row idx 32) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud050( TileDB row idx 33) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud051( TileDB row idx 34) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud052( TileDB row idx 35) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud053( TileDB row idx 36) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
/Callset C1aSud054( TileDB row idx 37) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud055( TileDB row idx 38) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud056( TileDB row idx 39) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud058( TileDB row idx 40) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud059( TileDB row idx 41) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud060( TileDB row idx 42) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud061( TileDB row idx 43) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud063( TileDB row idx 44) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud064( TileDB row idx 45) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud065( TileDB row idx 46) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
Sample/Callset C1aSud066( TileDB row idx 47) at Chromosome CM023248 position 26846163 (TileDB column 26846162) has too many genotypes in the combined VCF record : 1081 : current limit : 1024 (num_alleles, ploidy) = (46, 2). Fields, such as PL, with length equal to the number of genotypes will NOT be added for this sample for this location.
13:06:20.682 WARN MinimalGenotypingEngine - No genotype contained sufficient data to recalculate site and allele qualities. Site will be skipped at location CM023248:26846163
13:06:30.188 INFO ProgressMeter - CM023248:26860204 216.5 26364000 121768.9
13:06:40.231 INFO ProgressMeter - CM023248:26877204 216.7 26381000 121753.3
13:06:50.552 INFO ProgressMeter - CM023248:26896214 216.8 26400000 121744.3
13:07:00.684 INFO ProgressMeter - CM023248:26917218 217.0 26421000 121746.4
13:07:10.845 INFO ProgressMeter - CM023248:26933223 217.2 26437000 121725.1
13:07:21.114 INFO ProgressMeter - CM023248:26951224 217.4 26455000 121712.1
13:07:31.557 INFO ProgressMeter - CM023248:26970260 217.5 26474000 121702.0
13:07:41.848 INFO ProgressMeter - CM023248:26990262 217.7 26494000 121698.0
13:07:52.202 INFO ProgressMeter - CM023248:27013457 217.9 26517000 121707.2
13:08:02.461 INFO ProgressMeter - CM023248:27031467 218.0 26535000 121694.3
13:08:12.598 INFO ProgressMeter - CM023248:27048469 218.2 26552000 121678.0
13:08:22.884 INFO ProgressMeter - CM023248:27066497 218.4 26570000 121664.9
13:08:33.266 INFO ProgressMeter - CM023248:27090947 218.6 26586000 121641.8
13:08:43.548 INFO ProgressMeter - CM023248:27110991 218.7 26606000 121637.9
13:08:53.601 INFO ProgressMeter - CM023248:27127006 218.9 26622000 121617.9
13:09:04.899 INFO ProgressMeter - CM023248:27144047 219.1 26639000 121591.0
13:09:14.950 INFO ProgressMeter - CM023248:27166064 219.3 26661000 121598.4
13:09:25.449 INFO ProgressMeter - CM023248:27195553 219.4 26684000 121606.3
13:09:35.923 INFO ProgressMeter - CM023248:27212205 219.6 26700000 121582.4
13:09:46.055 INFO ProgressMeter - CM023248:27227250 219.8 26715000 121557.3
13:09:56.326 INFO ProgressMeter - CM023248:27244563 219.9 26732000 121540.0
13:10:06.726 INFO ProgressMeter - CM023248:27271173 220.1 26752000 121535.1
13:10:17.242 INFO ProgressMeter - CM023248:27291173 220.3 26772000 121529.2
13:10:27.372 INFO ProgressMeter - CM023248:27310181 220.5 26791000 121522.3
13:10:37.550 INFO ProgressMeter - CM023248:27329181 220.6 26810000 121515.0
13:10:48.548 INFO ProgressMeter - CM023248:27352181 220.8 26833000 121518.3
13:10:58.661 INFO ProgressMeter - CM023248:27371181 221.0 26852000 121511.6
13:11:08.741 INFO ProgressMeter - CM023248:27392212 221.2 26873000 121514.2
13:11:18.813 INFO ProgressMeter - CM023248:27415019 221.3 26894000 121517.0
13:11:29.514 INFO ProgressMeter - CM023248:27433236 221.5 26912000 121500.4
13:11:39.693 INFO ProgressMeter - CM023248:27455236 221.7 26934000 121506.6
13:11:49.712 INFO ProgressMeter - CM023248:27476473 221.8 26953000 121500.8
13:11:59.984 INFO ProgressMeter - CM023248:27492473 222.0 26969000 121479.2
13:12:10.226 INFO ProgressMeter - CM023248:27515032 222.2 26991000 121484.9
13:12:20.498 INFO ProgressMeter - CM023248:27533033 222.3 27009000 121472.3
13:12:30.953 INFO ProgressMeter - CM023248:27551033 222.5 27027000 121458.1
13:12:41.274 INFO ProgressMeter - CM023248:27570034 222.7 27046000 121449.6
13:12:51.577 INFO ProgressMeter - CM023248:27586138 222.9 27062000 121427.8
13:13:01.606 INFO ProgressMeter - CM023248:27601141 223.0 27077000 121404.0
13:13:12.173 INFO ProgressMeter - CM023248:27620141 223.2 27096000 121393.4
13:13:22.340 INFO ProgressMeter - CM023248:27640141 223.4 27116000 121390.8
13:13:32.837 INFO ProgressMeter - CM023248:27656141 223.6 27132000 121367.4
13:13:43.144 INFO ProgressMeter - CM023248:27676141 223.7 27152000 121363.6
13:13:53.579 INFO ProgressMeter - CM023248:27696141 223.9 27172000 121358.7
13:14:03.944 INFO ProgressMeter - CM023248:27713150 224.1 27189000 121341.0
13:14:14.377 INFO ProgressMeter - CM023248:27730152 224.2 27206000 121322.7
13:14:25.005 INFO ProgressMeter - CM023248:27749158 224.4 27225000 121311.6
13:14:35.210 INFO ProgressMeter - CM023248:27771158 224.6 27247000 121317.7
Chromosome CM023248 position 27776521 (TileDB column 27776520) has too many alleles in the combined VCF record : 54 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.
13:14:39.313 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location CM023248:27776521
13:14:45.456 INFO ProgressMeter - CM023248:27787160 224.8 27263000 121296.7
13:14:55.585 INFO ProgressMeter - CM023248:27810794 224.9 27283000 121294.6
13:15:05.870 INFO ProgressMeter - CM023248:27831107 225.1 27303000 121291.0
13:15:15.968 INFO ProgressMeter - CM023248:27852122 225.3 27324000 121293.6
13:15:26.346 INFO ProgressMeter - CM023248:27871282 225.4 27343000 121284.9
13:15:36.732 INFO ProgressMeter - CM023248:27893284 225.6 27365000 121289.3
13:15:47.359 INFO ProgressMeter - CM023248:27911284 225.8 27383000 121273.9
13:15:57.738 INFO ProgressMeter - CM023248:27928284 226.0 27400000 121256.3
13:16:08.138 INFO ProgressMeter - CM023248:27947360 226.1 27419000 121247.4
Chromosome CM023248 position 27954185 (TileDB column 27954184) has too many alleles in the combined VCF record : 85 : current limit : 50. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.
13:16:11.766 WARN MinimalGenotypingEngine - Attempting to genotype more than 50 alleles. Site will be skipped at location CM023248:27954185
13:16:18.678 INFO ProgressMeter - CM023248:27968185 226.3 27439000 121241.6
13:16:29.128 INFO ProgressMeter - CM023248:27983185 226.5 27454000 121214.6
13:16:39.280 INFO ProgressMeter - CM023248:27999202 226.7 27470000 121194.7
13:16:49.283 INFO ProgressMeter - CM023248:28015217 226.8 27486000 121176.2
13:16:57.733 INFO GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),3971.471023722291,Cpu time(s),3943.851816477288
[22 March 2023 13:16:58 GMT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 227.36 minutes.
Runtime.totalMemory()=2109734912
***********************************************************************
A USER ERROR has occurred: Bad input: We encountered a non-standard non-IUPAC base in the provided input sequence: '0'
-
The non-IUPAC characters are probably in the alleles in the VCFs coming from the Parabricks pipeline. I think you'll probably have to delete those lines before running GenomicsDBImport.
-
Hi there - sorry for the delay. I found the source of the error. The genome FASTA was truncated around halfway through the first chromosome. I had used a different file to the one I have on the GPU and at some point the copy had broken. GenotypeGVCFs crashed and reported the error once it reached that point.
Thanks for your help!
Tristan
Please sign in to leave a comment.
2 comments