Issue with combining g.vcf files
AnsweredI have an issue with combining my g.vcfs. It is 96 samples called against an amplicon based reference so it's approximately 38k (called the 50k) sequences of 250 bp or less. These codes previously worked using the full genome or using our 365 amplicon based reference of known SNPs. However, since I'm 100% sure which amplicons were amplified in this experiment since it was outsourced, I'm using the whole 50k as a reference. I have tried with and without making the output a g.vcf.gz, and with and without for the inputs g.vcf.gz files. Five of the samples are blanks which I've tried using as controls or removing them and it doesn't work.
GATK v4.2.3.0
Code used with g.vcf.gz output:
gatk CombineGVCFs --java-options "-Xmx20g" -R ${ref}.fasta --variant gvcf.list -O ${project}.g.vcf.gz
All the files seem fine after running GenotypeGVCFs. But after running CombineGVCfs using the code above, it never gets passed initializing the engine, it just reads all of the samples as shown below (only copied in the first few, but it reads all 96) and stalls:
Using GATK jar /home/barlex/miniconda3/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx20g -jar /home/barlex/miniconda3/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar CombineGVCFs -R 50k.fasta --variant gvcf.list -O CIxGP_SNPs.g.vcf.gz
09:11:30.941 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/barlex/miniconda3/share/gatk4-4.2.3.0-0/gatk-package-4.2.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 14, 2022 9:11:31 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
09:11:31.089 INFO CombineGVCFs - ------------------------------------------------------------
09:11:31.089 INFO CombineGVCFs - The Genome Analysis Toolkit (GATK) v4.2.3.0
09:11:31.089 INFO CombineGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
09:11:31.090 INFO CombineGVCFs - Executing as barlex@DESKTOP-VDKQE0G on Linux v4.4.0-19041-Microsoft amd64
09:11:31.090 INFO CombineGVCFs - Java runtime: OpenJDK 64-Bit Server VM v11.0.11+9-Ubuntu-0ubuntu2.20.04
09:11:31.090 INFO CombineGVCFs - Start Date/Time: January 14, 2022 at 9:11:30 AM PST
09:11:31.090 INFO CombineGVCFs - ------------------------------------------------------------
09:11:31.090 INFO CombineGVCFs - ------------------------------------------------------------
09:11:31.091 INFO CombineGVCFs - HTSJDK Version: 2.24.1
09:11:31.091 INFO CombineGVCFs - Picard Version: 2.25.4
09:11:31.091 INFO CombineGVCFs - Built for Spark Version: 2.4.5
09:11:31.091 INFO CombineGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:11:31.091 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:11:31.091 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:11:31.092 INFO CombineGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:11:31.092 INFO CombineGVCFs - Deflater: IntelDeflater
09:11:31.092 INFO CombineGVCFs - Inflater: IntelInflater
09:11:31.092 INFO CombineGVCFs - GCS max retries/reopens: 20
09:11:31.092 INFO CombineGVCFs - Requester pays: disabled
09:11:31.092 INFO CombineGVCFs - Initializing engine
09:11:32.042 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/CI_1_S50_R1_001.sorted.g.vcf
09:11:32.507 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/CI_2_S40_R1_001.sorted.g.vcf
09:11:33.163 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/CI_3_S18_R1_001.sorted.g.vcf
09:11:33.434 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/CI_4_S21_R1_001.sorted.g.vcf
09:11:33.793 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/CI_5_S94_R1_001.sorted.g.vcf
09:11:34.125 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/F2-A10_S10_R1_001.sorted.g.vcf
09:11:34.473 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/F2-A11_S11_R1_001.sorted.g.vcf
09:11:34.884 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/F2-A12_S12_R1_001.sorted.g.vcf
09:11:35.125 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/F2-A1_S1_R1_001.sorted.g.vcf
09:11:35.502 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/Data_Shaun/CIxGP_F2_Mapping/Genotyping/F2-A2_S2_R1_001.sorted.g.vcf
-
Hi Shaun Clare,
Is there a reason you are not using GenomicsDBImport? GenomicsDBImport has a lot of performance improvements that CombineGVCFs doesn't have.
Best,
Genevieve
-
Our pipeline was built using CombineGVCFs along time ago so I was familiar with it. Isn't the intervals option required to make the GenomicsDB datastore? I just want all SNP across the whole 'genome'
-
Yes, intervals are required with GenomicsDB. Many people use it to analyze the whole genome. You can provide an interval list that is just a list of your chromosomes. Here is an article containing information on how to use intervals with WGS: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
-
It seems to be working but I'm guessing this is going to take a long time. It got passed initializing the engine but gave a warning to not use more than 100 intervals, this uses 44,000 amplicons as intervals. Do you have any advice?
-
We have a usage and performance guidelines article for GenomicsDB. Generally, when you have many intervals, we recommend using the option --merge-input-intervals which can really help to speed up the import.
-
Unfortunately since its 44k separate amplicons, --merge-input-intervals wasn't able to work since it can't detect if they were abutting. I used --merge-contigs-into-num-partitions 50 instead and I got all the way to QTL mapping with results that make sense so I think it worked.
-
Oh good! Glad to hear it's working so far. Please let me know if you have further questions.
Please sign in to leave a comment.
7 comments