HaplotypeCaller produces VCF's from 1KG data that have low SNP overlap
REQUIRED for all errors and issues:
a) GATK version used:
b) Exact command used:
c) Entire program log:
a) GATK version used:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/nils/gatk/build/libs/gatk-package-4.3.0.0-30-g9f77b1f-SNAPSHOT-local.jar --version\nThe Genome Analysis Toolkit (GATK) v4.3.0.0-30-g9f77b1f-SNAPSHOT\nHTSJDK Version: 3.0.1\nPicard Version: 2.27.5
b) Exact command used:
HG02026
python3 $HOME/gatk/gatk --java-options \"-Xmx28g\" HaplotypeCaller --native-pair-hmm-threads 24 -R GRCh38_full_analysis_set_plus_decoy_hla.fa -I HG02026.final.bam -L chr1 -O HG02026.final.vcf -ERC GVCF
python3 $HOME/gatk/gatk --java-options \"-Xmx28g\" HaplotypeCaller --native-pair-hmm-threads 24 -R GRCh38_full_analysis_set_plus_decoy_hla.fa -I HG02025.final.bam -L chr1 -O HG02025.final.vcf -ERC GVCF
c) Entire program log:
HG02026
https://pastebin.com/5Zz4nKSn
HG02025 (Sadly I can't post the full GATK calling output from HG02025 due to some truncated RStudio Console output. Ttbomk one can't retrieve this without running the command again.)
I am attempting to identify variant calls for a trio from the 1000 Genomes Project. Currently, I have called variants for both the father and mother. However, I encountered an issue as their output GVCF files share only a limited number of SNPs in terms of position. Specifically, I found only around 9.5 million overlapping SNPs out of a total of 33 million. I could not determine any reason why both VCF files should not share the exact same positions in the VCF. Is it possible that my usage of the HaplotypeCaller is incorrect? Are there any additional parameters I should consider using?
Best regards,
Nils
-
Hello,
I don't think we have enough information to understand your problem. Are you checking the overlap in SNPs between the two unrelated parents? I'm not sure what the expected overlap there should be. Or are you comparing a parent to their child where you would expect more overlap.
Could you also explain how you're comparing the files? The g.vcf output will include a large number of reference blocks which which you would not expect to have much overlap. If you want to compare snps you really need to run genotyping on the gvcfs to get a final VCF.
Please sign in to leave a comment.
1 comment