Missing samples from the output vcf file produced with GenotypeGVCFs
Hi,
I am using GenotypeGVCFs to perform joint genotyping from samples that I previously analyzed using HaplotypeCaller with the option -ERC GVCF.
I created a genomics database using GenomicsDBImport and a list of 113 whole genomes of S. cerevisiae (16 chromosomes of 0.23 – 1.53 Mb) using the following code:
$GATK gatk --java-options "-Xmx32g" GenomicsDBImport --genomicsdb-workspace-path $RESULTS --batch-size 50 --intervals $ROOT_DIR/$INTERVALS -sample-name-map $ROOT_DIR/$GENOME_MAP --validate-sample-name-map --tmp-dir $TMP
$GATK gatk --java-options "-Xmx32g" GenotypeGVCFs -R $REF/$REF_FILE -V gendb://$RESULTS_ROOT -O $RESULTS_ROOT/$OUT_VCF
where $INTERVALS is a bed file indicating 16 intervals corresponding to the 16 chromosomes of S. cerevisiae ( I did not identified intervals within chromosomes).
Yet, without any error messages, I am consistently missing samples from the output vcf file produced with GenotypeGVCFs
I know the missing samples are in the GenomicsDB because if I try to add them using GenomicsDBImport --genomicsdb-update-workspace-path I get an error message saying the samples already exist. Yet they are not present in the output vcf file.From the dataset of 113 genomes the vcf file misses the last 13 samples.
I repeated the analysis with a smaller dataset of 62 samples and got the same problem, this time the vcf was missing the last 12 samples. Please note that the 12 samples missing here were included in the larger list and were processed with no problems.
I also repeated the analysis using the list of 113 genomes and adding the parameter "--consolidate" but got the same result with missing 13 samples.
$GATK gatk --java-options "-Xmx32g" GenomicsDBImport --genomicsdb-workspace-path $RESULTS --batch-size 50 --intervals $ROOT_DIR/$INTERVALS -sample-name-map $ROOT_DIR/$GENOME_MAP --validate-sample-name-map --reader-threads 4 --tmp-dir $TMP
The Genome Analysis Toolkit (GATK) v4.1.8.1
HTSJDK Version: 2.23.0
Picard Version: 2.22.8
Using GATK jar /gatk/gatk-package-4.1.8.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.8.1-local.jar GenotypeGVCFs -version
$GATK gatk --java-options "-Xmx32g" GenotypeGVCFs -R $REF/$REF_FILE -V gendb://$RESULTS_ROOT -O $RESULTS_ROOT/$OUT_VCF
I am concluding that somehow it seems that there is a limit of the number of samples included in the vcf file which does not make any sense as this code of was made to deal much larger datasets.
I would appreciate any help and I would be glad to send more information if needed.
Thanks
-
That an interesting problem. Can you run SelectVariants to convert the genomicsdb to a vcf and see if all the samples are present in the vcf. You could do this on a subset of variants. more info on how to use the tool is here: https://gatk.broadinstitute.org/hc/en-us/articles/360051305531-SelectVariants
-
Hi,
I run SelectVariants on my database of 113 samples and the output is still a vcf file with 100 samples.
I am checking the number of samples using bcftools query -l filename.vcf | wc -l -
- Are you on a shared NFS?
- How much physical memory do you have on your machine? I suspect that the tool is running out of memory which is why it is behaving this way. Can you also share the specs of the machine you are using?
- Take a look at this thread and try to implement the steps mentioned here to fix genomicsdb memory issues: https://gatk.broadinstitute.org/hc/en-us/community/posts/360074224671-GenomicsDBImport-running-out-of-memory-
- Can you try to recreate this issue with a smaller number of samples? Maybe try with 10 samples and let me know if this issue persists.
-
Hi Bhanu Gandham,
After fighting with my pipeline I realized that there was an error in the sampleMap for some of the samples. I also added “--consolidate --reader-threads 4” to the GenomicsDBImport command line. Everything together thankfully fixed the problem. My pipeline is working fine now. Sorry for the time you spent with this. Anyway, GATK helpdesk is great, many good ideas for troubleshooting here!
-
Good to know you were able to solve the issue and thanks for posting the solution!
Please sign in to leave a comment.
5 comments