GenotypeGVCFs is too long
AnsweredI try to use GenotypeGVCFs for human WGS data.
I use GATK 4.2.0.0 with the following command :
gatk --java-options "-Xmx20g -Xms20g" GenotypeGVCFs \
-R ${REF_Genome} \
-V gendb://${VCF_database_DIR} \
-O ${VCF_OUPUT_DIR}/gentaumix_raw2.vcf.gz \
--tmp-dir ${TMP_DIR} \
-D ${DBSNP} \
--sequence-dictionary ${Dict} \
-L ${Interval} \
-G StandardAnnotation -G AS_StandardAnnotation \
--only-output-calls-starting-in-intervals \
--merge-input-intervals
In the gendb there are only 6 gvcf, GenomicsDBImport took 2 days and 13 hours to finished with this command:
gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport \
-V ${GVCF_INPUT_DIR}/G632_DA_C001KUP_HY322DSXX.DUAL312.g.vcf \
-V ${GVCF_INPUT_DIR}/G632_DA_C001KV9_HYGGKDSXX.DUAL234.g.vcf \
-V ${GVCF_INPUT_DIR}/G632_DA_C001KVP_HYH2CDSXX.DUAL251.g.vcf \
-V ${GVCF_INPUT_DIR}/G632_DA_C001KVU_HYH2CDSXX.DUAL256.g.vcf \
-V ${GVCF_INPUT_DIR}/G632_DA_C001KW0_HYH2CDSXX.DUAL262.g.vcf \
-V ${GVCF_INPUT_DIR}/G632_DA_C001KW6_HYGHLDSXX.DUAL269.g.vcf \
--genomicsdb-workspace-path ${VCF_database_DIR} \
--tmp-dir ${TMP_DIR} \
--batch-size 6 \
--reader-threads 5 \
-L ${Interval} \
--merge-input-intervals true
But after a few hours it is not yet finished and is still at chr1 according to the log file.
Is there something wrong with the command? Is it normal that it is so long?
-
We have created a usage guidelines article for GenomicsDB that also applies when using GenotypeGVCFs with a GenomicsDB workspace: https://gatk.broadinstitute.org/hc/en-us/articles/360056138571-GDBI-usage-and-performance-guidelines. Please take a look and let me know if you have more questions!
Best,
Genevieve
-
Hi Geneviève,
In the article you are citing there is a point about the number of contig. What if there is a large number of contig but no data on most? to be precise i did the alignment with the grch38 reference containing the decoys and other unlocalized and unplaced contig, but then i removed all the reads that align with these contigs. Nevertheless these contig remains present in the header. Also I used intervals for haplotypecaller. So in the end even if these contigs remain present in the header of the gcvf there is no data on these contigs.
Best,
Quentin
-
Yes, with more contigs it can definitely slow down the job. I would recommend if possible deleting those lines in the header.
Another very helpful argument that I would like to point out is --genomicsdb-shared-posixfs-optimizations true, which is helpful if you are using a shared filesystem or cluster.
Please sign in to leave a comment.
3 comments