I'm fairly new with gatk and bioinformatics in general so apologies in advance if this doesn't make any sense. I'll provide whatever additional information I can! The server I work on currently runs version 126.96.36.199.
Some "quick" backstory... For a previous project, I called variants using Freebayes in a dataset of about 400 individuals and from this VCF, I have a set of ~3300 positions I used for downstream analysis.
A collaborator has given me an additional 24 individuals that I'd like to eventually add to the larger set and run the same analyses with. I made the transition over to gatk and first used HaplotypeCaller to generate a GVCF for each individual:
gatk HaplotypeCaller --reference reference.fa \
--intervals targetSNPs.bed \
--input sample.bam \
--output sample.g.vcf \
This seemed to work fine for each of the 24 samples (I received no errors) so I moved on to CombineGVCFs as suggested:
gatk CombineGVCFs --reference reference.fa \
--variant sample1.g.vcf \
--variant sample2.g.vcf \
--variant ... etc ... \
This also worked just fine with the exception that the output file had 3307 positions rather than the expected 3300. I was able to see the 7 random positions that were not in my original target bed file (visible missing data) and figured I would just extract them on the next step using GenotypeGVCF:
gatk GenotypeGVCFs --reference reference.fa \
--variant combinedGVCFs.g.vcf \
--output genotypedVCF.vcf \
Again, I received no errors when running and the report even says "Traversal complete. Processed 3300 total variants". However, my output VCF file only has calls for 1028 positions.
Is there something I might be misinterpreting or missing in my commands? Thank you so much!
Please sign in to leave a comment.