PostprocessGermlineCNVCalls extremely slow after JointGermlineCNVSegmentation and produces many artifacts on sexual chromosomes
Hello, I am a regular user of the gCNV pipeline of GATK4. Since version GATK 4.2.0.0, you have introduced the germline CNV calling joint which I wanted to try and I encountered several problems.
So I used, in order, the DetermineGermlineContigPloidy and GermlineCNVCaller tools (cutting my target into 8 bins) version 4.3.0.0 on a cohort of 540 patients. Then I used the PostProcessGermlineCaller tool to produce the VCF files for these patients. Next, I used the JointGermlineCNVSegmentation beta tool to produce a multisample VCF which I reused with PostProcessGermlineCaller to produce joined VCFs.
The problem is that the time needed to produce each VCF file has been multiplied by 20 (on average 120 minutes compared to 6), which makes it difficult to use on large cohorts.
Here is an extract of the logs, from a sample without, then with the --clustered-breakpoints option:
14:23:53.500 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so14:23:54.242 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------14:23:54.242 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.3.0.014:23:54.242 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/14:23:54.262 INFO PostprocessGermlineCNVCalls - Executing as tintest@dahu132.u-ga.fr on Linux v5.10.0-18-amd64 amd6414:23:54.262 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b0814:23:54.263 INFO PostprocessGermlineCNVCalls - Start Date/Time: December 2, 2022 2:23:53 PM GMT14:23:54.263 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------14:23:54.263 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------14:23:54.263 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 3.0.114:23:54.263 INFO PostprocessGermlineCNVCalls - Picard Version: 2.27.514:23:54.264 INFO PostprocessGermlineCNVCalls - Built for Spark Version: 2.4.514:23:54.264 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 214:23:54.264 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false14:23:54.264 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true14:23:54.264 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false14:23:54.264 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater14:23:54.264 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater14:23:54.264 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 2014:23:54.264 INFO PostprocessGermlineCNVCalls - Requester pays: disabled14:23:54.264 INFO PostprocessGermlineCNVCalls - Initializing engine14:27:20.051 INFO PostprocessGermlineCNVCalls - Done initializing engine14:27:21.530 INFO ProgressMeter - Starting traversal14:27:21.531 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute14:27:21.533 INFO ProgressMeter - unmapped 0.0 0 NaN14:27:21.533 INFO ProgressMeter - Traversal complete. Processed 0 total records in 0.0 minutes.14:27:21.533 INFO PostprocessGermlineCNVCalls - Generating intervals VCF file...14:27:21.643 INFO PostprocessGermlineCNVCalls - Writing intervals VCF file to /bettik/tintest/CNV_Hyperexome/intervals/genotyped-intervals-SAMPLE_6.vcf.gz...14:27:21.643 INFO PostprocessGermlineCNVCalls - Analyzing shard 1 / 8...14:27:22.840 INFO PostprocessGermlineCNVCalls - Analyzing shard 2 / 8...14:27:23.602 INFO PostprocessGermlineCNVCalls - Analyzing shard 3 / 8...14:27:24.430 INFO PostprocessGermlineCNVCalls - Analyzing shard 4 / 8...14:27:24.879 INFO PostprocessGermlineCNVCalls - Analyzing shard 5 / 8...14:27:26.062 INFO PostprocessGermlineCNVCalls - Analyzing shard 6 / 8...14:27:26.849 INFO PostprocessGermlineCNVCalls - Analyzing shard 7 / 8...14:27:27.893 INFO PostprocessGermlineCNVCalls - Analyzing shard 8 / 8...14:27:28.412 INFO PostprocessGermlineCNVCalls - Generating segments...14:29:52.532 INFO PostprocessGermlineCNVCalls - Parsing Python output...14:29:52.537 INFO PostprocessGermlineCNVCalls - Writing segments VCF file to /bettik/tintest/CNV_Hyperexome/segments/genotyped-segments-SAMPLE_6.vcf.gz...14:29:52.703 INFO PostprocessGermlineCNVCalls - Generating denoised copy ratios...14:29:53.592 INFO PostprocessGermlineCNVCalls - Writing denoised copy ratios to /bettik/tintest/CNV_Hyperexome/ratios/denoised-copy-ratios-SAMPLE_6.tsv...14:29:55.274 INFO PostprocessGermlineCNVCalls - PostprocessGermlineCNVCalls complete.14:29:55.275 INFO PostprocessGermlineCNVCalls - Shutting down engine[December 2, 2022 2:29:55 PM GMT] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 6.03 minutes.Runtime.totalMemory()=2820145152Using GATK jar /gatk/gatk-package-4.3.0.0-local.jarRunning:java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.3.0.0-local.jar PostprocessGermlineCNVCalls --model-shard-path GermlineCNVCaller/GermlineCNVCaller_1_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_2_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_3_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_4_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_5_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_6_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_7_of_8-model/ --model-shard-path GermlineCNVCaller/GermlineCNVCaller_8_of_8-model/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_1_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_2_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_3_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_4_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_5_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_6_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_7_of_8-calls/ --calls-shard-path GermlineCNVCaller/GermlineCNVCaller_8_of_8-calls/ --allosomal-contig chrX --allosomal-contig chrY --autosomal-ref-copy-number 2 --contig-ploidy-calls DetermineGermlineContigPloidy/DetermineGermlineContigPloidy-calls/ --sample-index 6 --output-genotyped-intervals intervals/genotyped-intervals-SAMPLE_6.vcf.gz --output-genotyped-segments segments/genotyped-segments-SAMPLE_6.vcf.gz --output-denoised-copy-ratios ratios/denoised-copy-ratios-SAMPLE_6.tsv --sequence-dictionary hg19_min_oldM.fa.dict
#PostprocessGermlineCNVCalls_joint
23:45:30.659 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so23:45:31.000 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------23:45:31.001 INFO PostprocessGermlineCNVCalls - The Genome Analysis Toolkit (GATK) v4.3.0.023:45:31.001 INFO PostprocessGermlineCNVCalls - For support and documentation go to https://software.broadinstitute.org/gatk/23:45:31.002 INFO PostprocessGermlineCNVCalls - Executing as testardqu@chu-lyon.fr@ge95142-vm1 on Linux v5.18.0-0.bpo.1-amd64 amd6423:45:31.002 INFO PostprocessGermlineCNVCalls - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b0823:45:31.002 INFO PostprocessGermlineCNVCalls - Start Date/Time: December 5, 2022 11:45:30 PM GMT23:45:31.002 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------23:45:31.002 INFO PostprocessGermlineCNVCalls - ------------------------------------------------------------23:45:31.003 INFO PostprocessGermlineCNVCalls - HTSJDK Version: 3.0.123:45:31.003 INFO PostprocessGermlineCNVCalls - Picard Version: 2.27.523:45:31.003 INFO PostprocessGermlineCNVCalls - Built for Spark Version: 2.4.523:45:31.003 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.COMPRESSION_LEVEL : 223:45:31.003 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false23:45:31.003 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true23:45:31.003 INFO PostprocessGermlineCNVCalls - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false23:45:31.004 INFO PostprocessGermlineCNVCalls - Deflater: IntelDeflater23:45:31.004 INFO PostprocessGermlineCNVCalls - Inflater: IntelInflater23:45:31.004 INFO PostprocessGermlineCNVCalls - GCS max retries/reopens: 2023:45:31.004 INFO PostprocessGermlineCNVCalls - Requester pays: disabled23:45:31.004 INFO PostprocessGermlineCNVCalls - Initializing engine23:46:06.321 INFO PostprocessGermlineCNVCalls - Done initializing engine23:46:07.433 INFO ProgressMeter - Starting traversal23:46:07.433 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute23:46:07.434 INFO ProgressMeter - unmapped 0.0 0 NaN23:46:07.434 INFO ProgressMeter - Traversal complete. Processed 0 total records in 0.0 minutes.23:46:07.434 INFO PostprocessGermlineCNVCalls - Generating intervals VCF file...23:46:07.460 INFO PostprocessGermlineCNVCalls - Writing intervals VCF file to /srv/scratch/testardqu/CNV_Hyperexome/intervals_joint/genotyped-intervals-SAMPLE_6.vcf.gz...23:46:07.460 INFO PostprocessGermlineCNVCalls - Analyzing shard 1 / 8...23:46:08.946 INFO PostprocessGermlineCNVCalls - Analyzing shard 2 / 8...23:46:09.725 INFO PostprocessGermlineCNVCalls - Analyzing shard 3 / 8...23:46:10.380 INFO PostprocessGermlineCNVCalls - Analyzing shard 4 / 8...23:46:11.132 INFO PostprocessGermlineCNVCalls - Analyzing shard 5 / 8...23:46:11.901 INFO PostprocessGermlineCNVCalls - Analyzing shard 6 / 8...23:46:12.730 INFO PostprocessGermlineCNVCalls - Analyzing shard 7 / 8...23:46:14.288 INFO PostprocessGermlineCNVCalls - Analyzing shard 8 / 8...23:46:15.617 INFO PostprocessGermlineCNVCalls - Generating segments...01:48:30.792 INFO PostprocessGermlineCNVCalls - Parsing Python output...01:48:30.875 INFO PostprocessGermlineCNVCalls - Writing segments VCF file to /srv/scratch/testardqu/CNV_Hyperexome/segments_joint/genotyped-segments-SAMPLE_6.vcf.gz...01:48:46.860 INFO PostprocessGermlineCNVCalls - Generating denoised copy ratios...01:48:47.487 INFO PostprocessGermlineCNVCalls - Writing denoised copy ratios to /srv/scratch/testardqu/CNV_Hyperexome/ratios_joint/denoised-copy-ratios-SAMPLE_6.tsv...01:48:47.773 INFO PostprocessGermlineCNVCalls - PostprocessGermlineCNVCalls complete.01:48:47.773 INFO PostprocessGermlineCNVCalls - Shutting down engine[December 6, 2022 1:48:47 AM GMT] org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls done. Elapsed time: 123.29 minutes.Runtime.totalMemory()=7257194496Using GATK jar /gatk/gatk-package-4.3.0.0-local.jarRunning:java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms4g -Djava.io.tmpdir=/srv/scratch/testardqu/CNV_Hyperexome/tmp/ -jar /gatk/gatk-package-4.3.0.0-local.jar PostprocessGermlineCNVCalls --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_1_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_2_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_3_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_4_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_5_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_6_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_7_of_8-model/ --model-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_8_of_8-model/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_1_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_2_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_3_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_4_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_5_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_6_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_7_of_8-calls/ --calls-shard-path /srv/scratch/testardqu/CNV_Hyperexome/GermlineCNVCaller/GermlineCNVCaller_8_of_8-calls/ --clustered-breakpoints /srv/scratch/testardqu/CNV_Hyperexome/CNV_Hyperexome.vcf.gz --input-intervals-vcf /srv/scratch/testardqu/CNV_Hyperexome/intervals/genotyped-intervals-SAMPLE_6.vcf.gz --allosomal-contig chrX --allosomal-contig chrY --autosomal-ref-copy-number 2 --contig-ploidy-calls /srv/scratch/testardqu/CNV_Hyperexome/DetermineGermlineContigPloidy/DetermineGermlineContigPloidy-calls/ --sample-index 6 --output-genotyped-intervals /srv/scratch/testardqu/CNV_Hyperexome/intervals_joint/genotyped-intervals-SAMPLE_6.vcf.gz --output-genotyped-segments /srv/scratch/testardqu/CNV_Hyperexome/segments_joint/genotyped-segments-SAMPLE_6.vcf.gz --output-denoised-copy-ratios /srv/scratch/testardqu/CNV_Hyperexome/ratios_joint/denoised-copy-ratios-SAMPLE_6.tsv --sequence-dictionary /srv/scratch/testardqu/CNV_Hyperexome/hg19_min_oldM.dict
Is this normal ? Is there a way to reduce the calculation time?
In addition, I noticed that an abnormal number of most likely artifactual CNVs were called on the sex chromosomes in the joined vcfs, no CNVs are operable there, while some CNVs were (supposedly) called correctly in the VCFs produced by the first iteration of PostProcessGermlineCNVCalls.
Here are commands that were run on the VCF segments produced by the 2nd iteration (with --clustered-breakpoints) that show a large number of artifactual CNVs on the sex chromosomes in my data (for the autosomal chromosomes, everything looks normal) :
zgrep -v "#" *.gz | grep chrY | sort | uniq | cut -f 3 | sort -V | uniq -c540 CNV_chrY_7042509_7064541540 CNV_chrY_9357472_9360034...540 CNV_chrY_24795591_24796548540 CNV_chrY_24795591_24893824zgrep -v "#" *.gz | grep chrY | sort | uniq | cut -f 3 | sort -V | uniq -c | wc -l27zgrep -v "#" *.gz | grep chrY | sort | uniq | grep PASS | cut -f 3 | sort -V | uniq -c540 CNV_chrY_7042509_7064541288 CNV_chrY_9357472_9360034...287 CNV_chrY_24795591_24796548285 CNV_chrY_24795591_24893824zgrep -v "#" *.gz | grep chrX | sort | uniq | cut -f 3 | sort -V | uniq -c540 CNV_chrX_283817_313314540 CNV_chrX_283817_387075...540 CNV_chrX_156022580_156025724540 CNV_chrX_156024682_156025724zgrep -v "#" *.gz | grep chrX | sort | uniq | cut -f 3 | sort -V | uniq -c | wc -l181
Some of them are eliminated by the PASS filter, but in reality, the vast majority remain, which is a problem. Moreover, no CNV other than these artifacts appear. There is just nothing but noise.
Can you help me out ?
Regards.
-
Hello,
Should I post a Github issue instead to get an answer?
Sincerely.
Please sign in to leave a comment.
1 comment