CombineGVCFs or GenomicDBImport for consolidating haploid bacterial GVCFs
Hi all, I am new to bioinformatics and genomics analyses, forgive me if this question is a bit basic.
I was given the task of running short-read WGS data of E. coli sample reads through a variant-calling pipeline based closley on GATK Best Practices. I have reached the variant-calling step itself, namely consolidating gVCFs, however there was a difference in some of the codes I was provided for this step. An older code uses the CombineGVCFs tool, whereas the newer code uses GenomicsDBImport, both for consolidating gVCFs generated by Haplotype Caller, using GATK 4.2.5.0.
My supervisor mentioned that in their previous use of this code it was recommended by this forum that for haploid data CombineGVCFs is the best option, whereas from reading through the forum and several tutorials I have found the standard to be GenomicsDBImport. From what I can understand this may have been a feature of GATK 3, as from looking ahead in the codes provided, it calls VariantAnnotator for variant annotation using the syntax;
java <memory> -jar <path_to_gatk>/GenomeAnalysisTK.jar \
-T VariantAnnotator...
.. Which I read was an older syntax.
It was also suggested that the version of GATK I am using is more than capable of working with non-diploid sample data, which was why I used HaplotypeCaller with the "-ploidy 1" flag.
From reading the article below, it was suggested that CombineGVCFs would only be used in the case where GenomicsDBImport can not, and that it is not as suited to larger datasets as GenomicsDBImport also.
(https://gatk.broadinstitute.org/hc/en-us/articles/360035889971--How-to-Consolidate-GVCFs-for-joint-calling-with-GenotypeGVCFs )
For context, my data is split into 2 datasets. One containing 727 samples, the other 352, of paired-end sequenced data with individual directories containing gvcfs as generated by HaplotypeCaller.
HaplotypeCaller Script:
java -Xmx40g -jar $sw_GATKpath HaplotypeCaller \
-R $reference_genome \
-I $bam_files/$filename \
-ploidy 1\
--emit-ref-confidence GVCF\
-O analysis/5_VariantCalling/5.1_HaplotypeCaller/$filename.g.vcf \
--tmp-dir $TMPDIR > $TMPDIR/$filename.HaplotypeCaller.txt 2>&1
mv $TMPDIR/$filename.HaplotypeCaller.txt analysis/5_VariantCalling/5.1_HaplotypeCaller/HC_logs
Description of variables;
- $sw_GATK path to my GATK 4.2.5.0 jar file
- $reference_genome path to the ".fna" reference file of E. Coli MG1655
- $bam_files/$filename path to bam file-containing directory, with $filename being individual files passed to this script by another script I use for parallelising the process.
If there is any further information required for this query please let me know and thanks in advance.
-
Hi Conor Sexton
Welcome to the Wonderland! We wish you a successful journey for your endeavors.
Both CombineGVCFs and GenomicsDBImport are capable of different ploidies so for ploidy 1 both are perfectly usable for your case.
We are currently on version 4.4.0.0 and GATK version 3.x is declared deprecated and not supported anymore. If you wish to use GATK4 our recommendation would be to stay with versions 4.2 and above but beware that version 4.4 requires java version 17 to run whereas older versions work with java version 8.
In order to run GATK versions 4 and above you don't need to call the jar files directly but you can run the toolset from the supplied python script named "gatk" and you may run any tool by just typing gatk ToolName ...
Once you collect your GVCF files using HaplotypeCaller with -ERC GVCF and -ploidy 1 parameters you can import them to a GenomicsDB file using GenomicsDBImport tool. Since you have large number of samples it is recommended to use GenomicsDBImport tool due to the fact that you can limit the number of concurrently imported samples to avoid too many open files errors to be thrown from your compute environment. It also provides a proper memory management by reducing the amount of memory needed to import all files. Once imported GenotypeGVCFs tool can be used to generated multisample VCF files from the GenomicsDB file.
Below are the documents that you may wish to consult if you have any doubts of how to use parameters.
Feel free to send us any questions about GATK and tools.
Regards.
-
Hi Gökalp Çelik ,
Thank you very much for your in depth reply and further advice. I look forward to learning more.
Regards,
Conor.
Please sign in to leave a comment.
2 comments