GenotypeGVCF too many genotypes
I am using using gatk-4.2.0.0. I am using pooled ddRADseq data.
I have performed the pipeline as follows:
./gatk HaplotypeCaller \
-I /home/bams/4_12.bam \
-R /home/gatk/genomic_refseq.fna \
-O /home/gatk/gvcf_by_sample/4_12_WG.g.vcf \
-ERC GVCF \
-ploidy 60
Then:
./gatk GenomicsDBImport \
--genomicsdb-workspace-path /home/gatk/gvcf_by_sample/genomic_work_space/ \
-L /home/gatk/gvcf_by_sample/intervals.list \
--sample-name-map /home/gatk/gvcf_by_sample/gvcf.sample_map \
--tmp-dir /home/gatk/gvcf_by_sample/tmp \
then using GenotypeGVCFs:
./gatk GenotypeGVCFs \
-R /home/gatk/GCF_003254395.2_Amel_HAv3.1_genomic_refseq.fna \
-V gendb:///home/gatk/gvcf_by_sample/genomic_work_space/ \
--sample-ploidy 60 \
-O /home/gatk/pooled_colony.vcf.gz
I get this error many times:
Sample/Callset 29_10( TileDB row idx 1) at Chromosome NC_037638.1 position 21745 (TileDB column 21744) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be addedfor this sample for this location.
Sample/Callset 29_11( TileDB row idx 2) at Chromosome NC_037638.1 position 21745 (TileDB column 21744) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be addedfor this sample for this location.
Sample/Callset 29_12( TileDB row idx 3) at Chromosome NC_037638.1 position 21745 (TileDB column 21744) has too many genotypes in the combined VCF record : 1891 : current limit : 1024 (num_alleles, ploidy) = (3, 60). Fields, such as PL, with length equal to the number of genotypes will NOT be addedfor this sample for this location.
My resulting vcf is empty.
I have seen that there is no good workaround in another post. But could someone explain to me why there are so many genotypes? Is this simply every single way you can produce those 3 alleles? Some have over 635,000 genotypes. How is that possible?
What I ultimately need is allele frequencies for downstream analysis, is there a way to build a VCF file at all? Can I build the file and discard genotypes that are unreliable?
many thanks
-
For pooled samples our recommendation is to use Mutect2. Can you try to use Mutect2 instead of HaplotypeCaller for your variant calling and let me know if that works better?
Please sign in to leave a comment.
1 comment