Problems combining gVCFs that have thousands of scaffolds
Dear all,
I have multiple g.vcf files that were obtained with HaplotypeCaller version 4.1.4.1. from Bam files that were previously aligned (with bwa) to a reference genomes of 497,616 scaffolds. The data is gDNA, obtained from shotgun sequence.
Now I want to combine all gVCF files in a single “cohort” VCF file. My problems arises when I do GenomicsDBImport. I am getting thousands of subfolders/files that pass the maximum number of files that my server allows. Here is the exact GATK commands I used :
# java version 1.8.0
gatk-4.1.4.1/gatk --java-options "-Xmx8G" GenomicsDBImport \
$(for file in *.g.vcf.gz; do echo "-V $file "; done) \
--genomicsdb-workspace-path ${GenW} \
--tmp-dir=${temp} \
-L NODE_19999
I tried ‘CombineGvcf’ instead but it has been several days running and is frozen in the following line:
‘19:53:31.697 INFO FeatureManager - Using codec VCFCodec to read file file:file1.vcf’
‘9:53:40.044 INFO FeatureManager - Using codec VCFCodec to read file file: file2.vcf‘
Here is the exact GATK commands I used for CombineGvcf:
gatk-4.1.4.1/gatk --java-options "-Xmx20G" \
CombineGVCFs \
-R ${Reference} \
$(for file in *.g.vcf.gz; do echo "-V $file "; done) \
-O ${Gvcf}/Ouput.g.vcf.gz \
--tmp-dir=${temp}
Also, I tried `HaplotypeCaller` on previous versions of GATK such as GATK 3.7.0 but it has been frozen for days when I try to call genotypes from my bam files. It gets stick in the following line:
INFO 12:18:08,378 GenomeAnalysisEngine - Strictness is SILENT
Here is the exact GATK commands I used for HaplotypeCalle with GATKv3.7.0:
java -jar -Xmx16g GenomeAnalysisTK-3.7-0-gcfedb67/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R reference.fa \
-ERC BP_RESOLUTION \
-mbq 20 \
-out_mode EMIT_ALL_SITES \
--dontUseSoftClippedBases \
-I ${BAM} \
-o output.g.vcf.gz
Do you have any suggestion in cases were the gVCFs happened to have a significant number of scaffolds? Thanks in advance for your help.
Dch
-
HI Dch
Please retry using GenomicsDBImport using `--sample-name-map` argument. See tool-docs for more info: https://gatk.broadinstitute.org/hc/en-us/articles/360036712071-GenomicsDBImport
-
Hi Dch, have you tried to use `--sample-name-map` argument and solve the problem? This issue is happening with me. It would be great if you sort it out.
-
Hi mmcui have you tried with the newest version of GenomicsDBImport? If you are using a shared cluster system, you can check out the argument --genomicsdb-shared-posixfs-optimizations to optimize running GenomicsDB. (more info here: https://gatk.broadinstitute.org/hc/en-us/articles/360051305591-GenomicsDBImport)
-
Hi Genevieve, The newest version installed on the cluster is gatk/4.1.7.0. But I used 4.0.8.1 because 4.1.7.0 is not good with the next step GenotypeGVCFs (in my experience).
-
mmcui the best version right now is 4.1.9.0, which is our current version. Please try that if possible.
Please sign in to leave a comment.
5 comments