I am trying to call variants from 12 RNA seq samples in a allohexaploid species without a genome. I've followed the RNA seq best practices pipeline. Briefly, I aligned reads to a de novo transcriptome that consists of ~100k+ transcripts assembly. Since I do not have a genome and I only have 12 samples, I thought that using combineGCFVs would be a better approach to combine VCFs rather than GenomicDB. I am working on a computing cluster that has a limit to the number of directories that can be created by a user, so I figured that would eliminate GenomicDB as a solution. The problem is that combineGCVFs takes a very very long time to run - its been running for 4+ days and never seems to finish. In fact, it seems like it never gets past initializing the engine. The standard output reads "Using codec VCFCodec to read file" for all 12 files, but does not move passed that. I've run this same pipeline on 96 WGS samples for a diploid species with an equally messy genome and combineGCVFs completed in less than 24 hours.
I know combineGCVFs is inefficient, but I am wondering if there are any work arounds to speed combineGCVFs up or to use GenomicDB without creating a directory for each sequence in the reference transcriptome. There was a similar post a few years back and a user documented their strategy (https://github.com/paulmaier/GATK-Joint-Genotyping-Pipeline). I've run the same approach described here as well, but the run times are still very, very long.
Can you please provide
a) GATK version used - 188.8.131.52
b) Exact GATK commands used
$GATK_HOME CombineGVCFs \
-R $fa_file \
-V 11664_5751_118393_HHVHYBGXF_WTSetar_drought_E05_19TY0004_1_TGGCGA_R1_.all.vcf.gz \
-V 11664_5751_118399_HHVHYBGXF_WTSetar_drought_C06_19TY0174_1_AAGACA_R1_.all.vcf.gz \
-V 11664_5751_118394_HHVHYBGXF_WTSetar_drought_F05_19TY0004_2_ACCGTG_R1_.all.vcf.gz \
-V 11664_5751_118400_HHVHYBGXF_WTSetar_drought_D06_19TY0174_2_ACAGAT_R1_.all.vcf.gz \
-V 11664_5751_118395_HHVHYBGXF_WTSetar_drought_G05_19TY0126_1_CAACAG_R1_.all.vcf.gz \
-V 11664_5751_118401_HHVHYBGXF_WTSetar_drought_E06_19TY0164_1_TAGGCT_R1_.all.vcf.gz \
-V 11664_5751_118396_HHVHYBGXF_WTSetar_drought_H05_19TY0126_2_GATTGT_R1_.all.vcf.gz \
-V 11664_5751_118402_HHVHYBGXF_WTSetar_drought_F06_19TY0164_2_CTCCAT_R1_.all.vcf.gz \
-V 11664_5751_118397_HHVHYBGXF_WTSetar_drought_A06_19TY0203_1_CTCTCG_R1_.all.vcf.gz \
-V 11664_5751_118403_HHVHYBGXF_WTSetar_drought_G06_19TY0181_1_GCATGG_R1_.all.vcf.gz \
-V 11664_5751_118398_HHVHYBGXF_WTSetar_drought_B06_19TY0203_2_TGACAC_R1_.all.vcf.gz \
-V 11664_5751_118404_HHVHYBGXF_WTSetar_drought_H06_19TY0181_2_AATAGC_R1_.all.vcf.gz \
c) The entire error log if applicable.
Please sign in to leave a comment.