GATK4 ModelSegment Error Allelic-count sites must be identical across all samples
AnsweredI am running ModelSegments command on a pair of Tumor and Normal BAMs. Both BAMs should available at the GDC portal. (Normal BAM uuid: 36948955-34e7-47f9-ac5d-e8b003e8070d, Tumor BAM uuid: 036d6edd-6555-4a89-9b78-14dc66c9be93) Both tumor and normal allelic count plus tumor's copy ratio were provided as the inputs of the ModelSegments.
For these 3 files: we checked the following things.
- both tumor and normal allelicCounts input files have exactly the same 10M+ sites (first two columns of content after @SQ lines)
- all 3 input files (two allelicCounts and denoisedCR) were generated from the same reference file, with exactly the same @SQ lines.
this issue has been tested on GATK versions: 4.2.4.0, 4.2.5.0, and 4.2.6.1. The following command is run in board institute gatk docker image:
gatk --java-options -Xmx29000m ModelSegments --allelic-counts ./TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv --denoised-copy-ratios ./TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.denoisedCR.tsv --output-prefix TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn --genotyping-base-error-rate 0.05 --genotyping-homozygous-log-ratio-threshold -10.0 --kernel-approximation-dimension 100 --kernel-scaling-allele-fraction 1.0 --kernel-variance-allele-fraction 0.025 --kernel-variance-copy-ratio 0.0 --maximum-number-of-segments-per-chromosome 1000 --maximum-number-of-smoothing-iterations 10 --minimum-total-allele-count-case 30 --minimum-total-allele-count-normal 30 --minor-allele-fraction-prior-alpha 25.0 --normal-allelic-counts ./TCGA-97-8179-10A-01D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv --number-of-burn-in-samples-allele-fraction 50 --number-of-burn-in-samples-copy-ratio 50 --number-of-changepoints-penalty-factor 1.0 --number-of-samples-allele-fraction 100 --number-of-samples-copy-ratio 100 --number-of-smoothing-iterations-per-fit 0 --output out --smoothing-credible-interval-threshold-allele-fraction 2.0 --smoothing-credible-interval-threshold-copy-ratio 2.0 --window-size 8 --window-size 16 --window-size 32 --window-size 64 --window-size 128 --window-size 256 && touch out/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.hets.normal.tsv
The error message "Allelic-count sites must be identical across all samples." looks like raised by https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/copynumber/utils/genotyping/NaiveHeterozygousPileupGenotypingUtils.java#L51-L57 . I also dig into the intermediate variable values of the input and output of this line, there are 3726 SNP sites unique in normal and 1 site unique in tumor. But all of these sites exist in the input TSV file that means they should not be recognized as unique. I also tried to remove these 3726+1 sites from both of the input files. But will still hit this error.
At the last, this is the FULL error log from GATK ModelSegments
gatk --java-options -Xmx29000m ModelSegments --verbosity DEBUG --allelic-counts /gatk/testworkflow/tmp07WSctC6/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv --denoised-copy-ratios /gatk/testworkflow/tmp07hbUgzb/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.denoisedCR.tsv --output-prefix TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn --genotyping-base-error-rate 0.05 --genotyping-homozygous-log-ratio-threshold -10.0 --kernel-approximation-dimension 100 --kernel-scaling-allele-fraction 1.0 --kernel-variance-allele-fraction 0.025 --kernel-variance-copy-ratio 0.0 --maximum-number-of-segments-per-chromosome 1000 --maximum-number-of-smoothing-iterations 10 --minimum-total-allele-count-case 30 --minimum-total-allele-count-normal 30 --minor-allele-fraction-prior-alpha 25.0 --normal-allelic-counts /gatk/testworkflow/tmp07bxSaIR/TCGA-97-8179-10A-01D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv --number-of-burn-in-samples-allele-fraction 50 --number-of-burn-in-samples-copy-ratio 50 --number-of-changepoints-penalty-factor 1.0 --number-of-samples-allele-fraction 100 --number-of-samples-copy-ratio 100 --number-of-smoothing-iterations-per-fit 0 --output out --smoothing-credible-interval-threshold-allele-fraction 2.0 --smoothing-credible-interval-threshold-copy-ratio 2.0 --window-size 8 --window-size 16 --window-size 32 --window-size 64 --window-size 128 --window-size 256 && touch out/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.hets.normal.tsv
Using GATK jar /gatk/gatk-package-4.2.6.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx29000m -jar /gatk/gatk-package-4.2.6.1-local.jar ModelSegments --verbosity DEBUG --allelic-counts /gatk/testworkflow/tmp07WSctC6/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv --denoised-copy-ratios /gatk/testworkflow/tmp07hbUgzb/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.denoisedCR.tsv --output-prefix TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn --genotyping-base-error-rate 0.05 --genotyping-homozygous-log-ratio-threshold -10.0 --kernel-approximation-dimension 100 --kernel-scaling-allele-fraction 1.0 --kernel-variance-allele-fraction 0.025 --kernel-variance-copy-ratio 0.0 --maximum-number-of-segments-per-chromosome 1000 --maximum-number-of-smoothing-iterations 10 --minimum-total-allele-count-case 30 --minimum-total-allele-count-normal 30 --minor-allele-fraction-prior-alpha 25.0 --normal-allelic-counts /gatk/testworkflow/tmp07bxSaIR/TCGA-97-8179-10A-01D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv --number-of-burn-in-samples-allele-fraction 50 --number-of-burn-in-samples-copy-ratio 50 --number-of-changepoints-penalty-factor 1.0 --number-of-samples-allele-fraction 100 --number-of-samples-copy-ratio 100 --number-of-smoothing-iterations-per-fit 0 --output out --smoothing-credible-interval-threshold-allele-fraction 2.0 --smoothing-credible-interval-threshold-copy-ratio 2.0 --window-size 8 --window-size 16 --window-size 32 --window-size 64 --window-size 128 --window-size 256
15:48:20.617 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
15:48:20.648 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression8917519267259328425.so
15:48:20.803 INFO ModelSegments - ------------------------------------------------------------
15:48:20.804 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.2.6.1
15:48:20.804 INFO ModelSegments - For support and documentation go to https://software.broadinstitute.org/gatk/
15:48:20.804 INFO ModelSegments - Executing as root@ad50722014ad on Linux v4.15.0-118-generic amd64
15:48:20.805 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
15:48:20.805 INFO ModelSegments - Start Date/Time: July 8, 2022 3:48:20 PM GMT
15:48:20.805 INFO ModelSegments - ------------------------------------------------------------
15:48:20.805 INFO ModelSegments - ------------------------------------------------------------
15:48:20.806 INFO ModelSegments - HTSJDK Version: 2.24.1
15:48:20.806 INFO ModelSegments - Picard Version: 2.27.1
15:48:20.806 INFO ModelSegments - Built for Spark Version: 2.4.5
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.BUFFER_SIZE : 131072
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.CREATE_INDEX : false
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.CREATE_MD5 : false
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.CUSTOM_READER_FACTORY :
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
15:48:20.808 INFO ModelSegments - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.REFERENCE_FASTA : null
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:48:20.809 INFO ModelSegments - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
15:48:20.809 DEBUG ConfigFactory - Configuration file values:
15:48:20.815 DEBUG ConfigFactory - gcsMaxRetries = 20
15:48:20.815 DEBUG ConfigFactory - gcsProjectForRequesterPays =
15:48:20.815 DEBUG ConfigFactory - gatk_stacktrace_on_user_exception = false
15:48:20.815 DEBUG ConfigFactory - samjdk.use_async_io_read_samtools = false
15:48:20.815 DEBUG ConfigFactory - samjdk.use_async_io_write_samtools = true
15:48:20.815 DEBUG ConfigFactory - samjdk.use_async_io_write_tribble = false
15:48:20.816 DEBUG ConfigFactory - samjdk.compression_level = 2
15:48:20.816 DEBUG ConfigFactory - spark.kryoserializer.buffer.max = 512m
15:48:20.816 DEBUG ConfigFactory - spark.driver.maxResultSize = 0
15:48:20.816 DEBUG ConfigFactory - spark.driver.userClassPathFirst = true
15:48:20.816 DEBUG ConfigFactory - spark.io.compression.codec = lzf
15:48:20.816 DEBUG ConfigFactory - spark.executor.memoryOverhead = 600
15:48:20.816 DEBUG ConfigFactory - spark.driver.extraJavaOptions =
15:48:20.816 DEBUG ConfigFactory - spark.executor.extraJavaOptions =
15:48:20.816 DEBUG ConfigFactory - codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
15:48:20.816 DEBUG ConfigFactory - read_filter_packages = [org.broadinstitute.hellbender.engine.filters]
15:48:20.817 DEBUG ConfigFactory - annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]
15:48:20.817 DEBUG ConfigFactory - cloudPrefetchBuffer = 40
15:48:20.817 DEBUG ConfigFactory - cloudIndexPrefetchBuffer = -1
15:48:20.817 DEBUG ConfigFactory - createOutputBamIndex = true
15:48:20.817 INFO ModelSegments - Deflater: IntelDeflater
15:48:20.817 INFO ModelSegments - Inflater: IntelInflater
15:48:20.817 INFO ModelSegments - GCS max retries/reopens: 20
15:48:20.817 INFO ModelSegments - Requester pays: disabled
15:48:20.817 INFO ModelSegments - Initializing engine
15:48:20.818 INFO ModelSegments - Done initializing engine
15:48:20.818 INFO ModelSegments - Used memory (MB) after initializing engine: 215
15:48:20.831 INFO ModelSegments - Reading file (/gatk/testworkflow/tmp07hbUgzb/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.denoisedCR.tsv)...
15:48:26.018 INFO ModelSegments - Reading file (/gatk/testworkflow/tmp07WSctC6/TCGA-97-8179-01A-11D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv)...
15:48:40.123 INFO ModelSegments - Reading file (/gatk/testworkflow/tmp07bxSaIR/TCGA-97-8179-10A-01D-A46J-10_wgs_Illumina_gdc_realn.allelicCounts.tsv)...
15:51:21.134 INFO ModelSegments - Used memory (MB) after reading files: 5476
15:51:25.989 INFO ModelSegments - Used memory (MB) after validating data: 3332
15:51:29.478 INFO NaiveHeterozygousPileupGenotypingUtils - Genotyping heterozygous sites from available allelic counts...
15:51:29.478 INFO NaiveHeterozygousPileupGenotypingUtils - Matched normal was provided, running in matched-normal mode...
15:51:32.886 INFO NaiveHeterozygousPileupGenotypingUtils - Retained 9051371 / 10551661 sites after filtering allelic counts with total count less than 30 in matched-normal sample TCGA-97-8179-10A-01D-A46J-10...
15:51:39.874 INFO NaiveHeterozygousPileupGenotypingUtils - Retained 9015219 / 10551661 sites after filtering on overlap with copy-ratio intervals in matched-normal sample TCGA-97-8179-10A-01D-A46J-10...
15:53:46.466 INFO NaiveHeterozygousPileupGenotypingUtils - Retained 1532481 / 10551661 sites after filtering on heterozygosity in matched-normal sample TCGA-97-8179-10A-01D-A46J-10...
15:53:50.434 INFO NaiveHeterozygousPileupGenotypingUtils - Retained 10375547 / 10551661 sites after filtering allelic counts with total count less than 30 in case sample TCGA-97-8179-01A-11D-A46J-10...
15:53:55.133 INFO NaiveHeterozygousPileupGenotypingUtils - Retained 10013210 / 10551661 sites after filtering on overlap with copy-ratio intervals in case sample TCGA-97-8179-01A-11D-A46J-10...
15:53:55.146 INFO NaiveHeterozygousPileupGenotypingUtils - Retaining allelic counts for case sample TCGA-97-8179-10A-01D-A46J-10 at heterozygous sites in matched-normal sample TCGA-97-8179-01A-11D-A46J-10...
15:53:59.229 INFO NaiveHeterozygousPileupGenotypingUtils - Retained 1531934 / 10551661 sites after applying all filters to case sample TCGA-97-8179-01A-11D-A46J-10.
15:53:59.948 INFO ModelSegments - Shutting down engine
[July 8, 2022 3:53:59 PM GMT] org.broadinstitute.hellbender.tools.copynumber.ModelSegments done. Elapsed time: 5.66 minutes.
Runtime.totalMemory()=11715739648
java.lang.IllegalArgumentException: Allelic-count sites must be identical across all samples.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.utils.genotyping.NaiveHeterozygousPileupGenotypingUtils$NaiveHeterozygousPileupGenotypingResult.<init>(NaiveHeterozygousPileupGenotypingUtils.java:51)
at org.broadinstitute.hellbender.tools.copynumber.utils.genotyping.NaiveHeterozygousPileupGenotypingUtils$NaiveHeterozygousPileupGenotypingResult.<init>(NaiveHeterozygousPileupGenotypingUtils.java:41)
at org.broadinstitute.hellbender.tools.copynumber.utils.genotyping.NaiveHeterozygousPileupGenotypingUtils.genotypeHets(NaiveHeterozygousPileupGenotypingUtils.java:204)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$ModelSegmentsData.<init>(ModelSegments.java:722)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments$ModelSegmentsData.<init>(ModelSegments.java:638)
at org.broadinstitute.hellbender.tools.copynumber.ModelSegments.doWork(ModelSegments.java:492)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
-
Hi Linghao Song,
Thank you for writing to the GATK forum! We hope we can help you sort this out.
Please try running ModelSegments again with the --minimum-total-allele-count-case parameter set as the current default value of 0.
I hope this helps; please let me know if that works!
Best,
Anthony -
Thank you Anthony!
I think reducing -minimum-total-allele-count-case resolves our issue!
Best,
Linghao
Please sign in to leave a comment.
2 comments