Problems combining gVCFs that have thousands of scaffolds
Dear all,
I have multiple g.vcf files that were obtained with HaplotypeCaller version from Bam files that were previously aligned (with bwa) to a reference genomes of 497,616 scaffolds. The data is gDNA, obtained from shotgun sequence.
Now I want to combine all gVCF files in a single “cohort” VCF file. My problems arises when I do GenomicsDBImport. I am getting thousands of subfolders/files that pass the maximum number of files that my server allows. Here is the exact GATK commands I used :
# java version 1.8.0
gatk- --java-options "-Xmx8G" GenomicsDBImport \
$(for file in *.g.vcf.gz; do echo "-V $file "; done) \
--genomicsdb-workspace-path ${GenW} \
--tmp-dir=${temp} \
-L NODE_19999
I tried ‘CombineGvcf’ instead but it has been several days running and is frozen in the following line:
‘19:53:31.697 INFO FeatureManager - Using codec VCFCodec to read file file:file1.vcf’
‘9:53:40.044 INFO FeatureManager - Using codec VCFCodec to read file file: file2.vcf‘
Here is the exact GATK commands I used for CombineGvcf:
gatk- --java-options "-Xmx20G" \
CombineGVCFs \
-R ${Reference} \
$(for file in *.g.vcf.gz; do echo "-V $file "; done) \
-O ${Gvcf}/Ouput.g.vcf.gz \
Also, I tried `HaplotypeCaller` on previous versions of GATK such as GATK 3.7.0 but it has been frozen for days when I try to call genotypes from my bam files. It gets stick in the following line:
INFO 12:18:08,378 GenomeAnalysisEngine - Strictness is SILENT
Here is the exact GATK commands I used for HaplotypeCalle with GATKv3.7.0:
java -jar -Xmx16g GenomeAnalysisTK-3.7-0-gcfedb67/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-mbq 20 \
-out_mode EMIT_ALL_SITES \
--dontUseSoftClippedBases \
-I ${BAM} \
-o output.g.vcf.gz
Do you have any suggestion in cases were the gVCFs happened to have a significant number of scaffolds? Thanks in advance for your help.
HI Dch
Please retry using GenomicsDBImport using `--sample-name-map` argument. See tool-docs for more info:
Hi Dch, have you tried to use `--sample-name-map` argument and solve the problem? This issue is happening with me. It would be great if you sort it out.
Hi mmcui have you tried with the newest version of GenomicsDBImport? If you are using a shared cluster system, you can check out the argument --genomicsdb-shared-posixfs-optimizations to optimize running GenomicsDB. (more info here:
Hi Genevieve, The newest version installed on the cluster is gatk/ But I used because is not good with the next step GenotypeGVCFs (in my experience).
mmcui the best version right now is, which is our current version. Please try that if possible.
Please sign in to leave a comment.