What intervals were used to create GATK's reference panel of normals for WGS CNV calling?
I'm using gatk 4.2.5.0 and trying to use the hg38 PoN_4.0_WGS_for_public.pon.hdf5 to normalize CNV calls for some WGS samples I have. I keep getting an error saying the sample intervals must patch the original intervals used to build the panel of normals. Since I didn't build this panel, I don't know what these original intervals are and haven't been able to find good documentation of this.
These are the commands I've tried:
1. Create intervals with 1000bp bins using gatk publicly available files (wgs_coverage_regions.hg38.interval_list)
~/bin/gatk-4.2.5.0/gatk PreprocessIntervals -L ../reference_files/hg38/wgs_coverage_regions.hg38.interval_list -R ../reference_files/hg38/Homo_sapiens_assembly38.fasta --bin-length 1000 --interval-merging-rule OVERLAPPING_ONLY -O wgs_coverage_regions.hg38.1k.preprocessed.interval_list
1. get read counts over intervals for my sample
~/bin/gatk-4.2.5.0/gatk CollectReadCounts -I bam/SL497849.hg38.bam -L wgs_coverage_regions.hg38.1k.preprocessed.interval_list --interval-merging-rule OVERLAPPING_ONLY -O frag_counts_wgscov2/SL497849.counts.hdf5
2. Try to use the public PoN to denoise the counts
~/bin/gatk-4.2.5.0/gatk --java-options "-Xmx12g" DenoiseReadCounts -I frag_counts_wgscov2/SL497849.counts.hdf5 --count-panel-of-normals ~/reference_files/hg38/PoN_4.0_WGS_for_public.pon.hdf5 --standardized-copy-ratios cnv_results/SL497846.standardizedCR.tsv --denoised-copy-ratios cnv_results/SL497846.denoisedCR.tsv
This is the error log I get at the last step:
Using GATK jar /home/ballint2/bin/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx12g -jar /home/ballint2/bin/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar DenoiseReadCounts -I frag_counts_wgscov2/SL497849.counts.hdf5 --count-panel-of-normals /home/ballint2/reference_files/hg38/PoN_4.0_WGS_for_public.pon.hdf5 --standardized-copy-ratios cnv_results/SL497846.standardizedCR.tsv --denoised-copy-ratios cnv_results/SL497846.denoisedCR.tsv
12:23:52.839 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ballint2/bin/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Mar 14, 2022 12:23:53 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
12:23:53.376 INFO DenoiseReadCounts - ------------------------------------------------------------
12:23:53.376 INFO DenoiseReadCounts - The Genome Analysis Toolkit (GATK) v4.2.5.0
12:23:53.376 INFO DenoiseReadCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
12:23:53.376 INFO DenoiseReadCounts - Executing as ballint2@ip-172-25-130-233.rdcloud.bms.com on Linux v4.18.0-365.el8.x86_64 amd64
12:23:53.376 INFO DenoiseReadCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_322-b06
12:23:53.377 INFO DenoiseReadCounts - Start Date/Time: March 14, 2022 12:23:52 PM EDT
12:23:53.377 INFO DenoiseReadCounts - ------------------------------------------------------------
12:23:53.377 INFO DenoiseReadCounts - ------------------------------------------------------------
12:23:53.377 INFO DenoiseReadCounts - HTSJDK Version: 2.24.1
12:23:53.378 INFO DenoiseReadCounts - Picard Version: 2.25.4
12:23:53.378 INFO DenoiseReadCounts - Built for Spark Version: 2.4.5
12:23:53.378 INFO DenoiseReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:23:53.378 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:23:53.378 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:23:53.378 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:23:53.378 INFO DenoiseReadCounts - Deflater: IntelDeflater
12:23:53.378 INFO DenoiseReadCounts - Inflater: IntelInflater
12:23:53.378 INFO DenoiseReadCounts - GCS max retries/reopens: 20
12:23:53.378 INFO DenoiseReadCounts - Requester pays: disabled
12:23:53.379 INFO DenoiseReadCounts - Initializing engine
12:23:53.379 INFO DenoiseReadCounts - Done initializing engine
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
12:23:53.529 INFO DenoiseReadCounts - Reading read-counts file (frag_counts_wgscov2/SL497849.counts.hdf5)...
12:23:55.676 INFO SVDDenoisingUtils - Validating sample intervals against original intervals used to build panel of normals...
12:23:59.360 INFO DenoiseReadCounts - Shutting down engine
[March 14, 2022 12:23:59 PM EDT] org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=2469920768
java.lang.IllegalArgumentException: Sample intervals must be identical to the original intervals used to build the panel of normals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:798)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.denoise(SVDDenoisingUtils.java:119)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDReadCountPanelOfNormals.denoise(SVDReadCountPanelOfNormals.java:88)
at org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts.doWork(DenoiseReadCounts.java:204)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Thank you for any help.
-
Hi Tracy Ballinger,
Thanks for writing in. Could you take a look at this previous forum post related to the same issue? It looks like it may be the same problem occurring here. Could you try the suggestion by Samuel in that post and let me know if it is helpful?
Kind regards,
Pamela
-
Hi Pamela,
I am having a different issue because I'm trying to use a panel of normals that has already been constructed and made available through GATK. I don't have my own panel of normals, so I'm trying the available resource I found referenced in the gatk workflow "gatk4-somatic-with-preprocessing/FullSomaticPipeline_public-urls.json" with this link: gs://gatk-test-data/cnv/somatic/PoN_4.0_WGS_for_public.pon.hdf5.
I was struggling to find which intervals were used to construct this or any other documentation on it. I have since been able to get the intervals from the hdf5 file, but any documentation on how the GATK PoN_4.0_WGS_for_public.pon.hdf5 was constructed would still be appreciated.
kind regards,
Tracy
-
Hi Tracy Ballinger,
Okay, thank you for clarifying and I'm glad that you were still able to find the intervals. At the moment, I don't believe we have documentation about explicitly how the panel of normals and intervals were constructed. However, you can find some general information about how the Broad creates the public panel of normals and how WGS intervals are determined in the Panel of Normals and Interval and Interval Lists documentation. Please let me know if this is helpful and if you are still experiencing any issues with running the pipeline.
Kind regards,
Pamela
-
I think Tracy raise a great question,
If you only provide the PON but not what interval were used to generate the PON, then Denoise will error.
Please sign in to leave a comment.
4 comments