Differences between -ERC GVCF and -ERC BP_RESOLUTION in chr7:142458451 region from HaplotypeCaller GVCF.
REQUIRED for all errors and issues:
a) GATK version used:
4.1.9.0
b) Exact command used:
b-1) With -ERC GVCF:
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --sample-ploidy 2 --gvcf-gq-bands 10 --gvcf-gq-bands 20 --gvcf-gq-bands 30 --gvcf-gq-bands 40 --gvcf-gq-bands 60 --gvcf-gq-bands 80 --native-pair-hmm-threads 1 --emit-ref-confidence GVCF --output /storm/Analysis/Genome_Health/24GHP021/GHP/work/bcbiotx/tmpcxuqxrha/24GHP021-121_20220120-171-0014-joint-chr1_0_18097745.vcf.gz --intervals /storm/Analysis/Genome_Health/24GHP021/GHP/work/gatk-haplotype/chr1/24GHP021-121_20220120-171-0014-joint-chr1_0_18097745-regions.bed --interval-set-rule INTERSECTION --input /storm/Analysis/Genome_Health/24GHP021/GHP/work/align/24GHP021-121_20220120-171-0014/24GHP021-121_20220120-171-0014-sort-recal.bam --reference /storm/Tools/bcbio/bcbio_v1.2.4/genomes/Hsapiens/hg19_nocontig/seq/hg19_nocontig.fa --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --num-reference-samples-if-no-call 0 --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --max-mnp-distance 0 --force-call-filtered-alleles false --allele-informative-reads-overlap-margin 2 --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-reads-per-alignment-start 50 --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.1.9.0",Date="February 8, 2024 5:03:19 AM KST">
b-2) With -ERC BP_RESOLUTION:
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --sample-ploidy 2 --gvcf-gq-bands 10 --gvcf-gq-bands 20 --gvcf-gq-bands 30 --gvcf-gq-bands 40 --gvcf-gq-bands 60 --gvcf-gq-bands 80 --native-pair-hmm-threads 1 --emit-ref-confidence BP_RESOLUTION --output /storm/Analysis/Genome_Health/24GHP021/vcf/24GHP021-121_20220120-171-0014-joint.vcf.gz --intervals /storm/Analysis/Genome_Health/Script/bed/GHP.v2_target_300bp_4delSNV.bed --interval-set-rule INTERSECTION --input /storm/Analysis/Genome_Health/24GHP021/GHP/work/align/24GHP021-121_20220120-171-0014/24GHP021-121_20220120-171-0014-sort.bam --reference /storm/Tools/bcbio/bcbio_v1.2.4/genomes/Hsapiens/hg19_nocontig/seq/hg19_nocontig.fa --tmp-dir /storm/Analysis/Genome_Health/24GHP021/GHP/work/bcbiotx/tmp --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation MappingQuality --annotation DepthPerAlleleBySample --annotation Coverage --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-alternate-alleles 6 --max-genotype-count 1024 --num-reference-samples-if-no-call 0 --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --max-mnp-distance 0 --force-call-filtered-alleles false --allele-informative-reads-overlap-margin 2 --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-reads-per-alignment-start 50 --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.1.9.0",Date="February 15, 2024 3:21:49 PM KST">
c) Entire program log:
I created a GVCF using HaplotypeCaller because I wanted to check the variant information in the region chr7:142458451. Here's what I found:
1) With -ERC GVCF:
In this case, the information for the region chr7:142458451 does not appear:
```
chr7 142458151 . C <NON_REF> . . END=142458403 GT:DP:GQ:MIN_DP:PL 0/0:1:0:0:0,0,0
chr7 142458699 . C <NON_REF> . . END=142458751 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
```
2) With -ERC BP_RESOLUTION:
In this case, the information for the region chr7:142458451 is present:
```
chr7 142458450 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,84
chr7 142458451 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,84
chr7 142458452 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:2,0:2:6:0,6,84
```
When using GenotypeGVCFs with the --include-non-variant-sites option:
1) The region chr7:142458451 information is missing in the output from case 1).
2) The region chr7:142458451 information is present only in the output from case 2), as shown below:
```
chr7 142458451 . A . . . DP=2 GT:AD:DP:RGQ 0/0:2:2:6
```
The difference in results between -ERC GVCF and -ERC BP_RESOLUTION in the GVCF creation is typically attributed to the file size and the granularity of representation in base units. However, in this case, the presence or absence of information for the region chr7:142458451 in the GVCF file results in different outputs in the VCF file. Could you please explain why this is happening?
-
Hi 차주영
Using BP_RESOLUTION will produce output per base as the parameter implies and regardless of any variant site found in the loci --include-non-variant-sites will produce every single base as a entry in the VCF file.
However in GVCF mode that particular site is kept inside a long stretch of reference block therefore genotyping that long reference block will not return any single nucleotide as a HOMREF and you will not be able to get that entry in the VCF.
Regardless, looking at the counts at that position, it is almost a HOMREF or can even be a NO_CALL if there is not enough evidence.
I hope this helps.
-
Thank you for your response. However, what I'm curious about is slightly different from your explanation.
What I'm wondering is, when analyzing the region chr7:142458451 without any variants:
- In the gVCF generated with -ERC BP_RESOLUTION, the information for the region chr7:142458451 is present, and in the resulting VCF file from genotypeGVCF --include-non-variant-sites, there is a non-variant result as HOMREF. On the other hand,
- In the gVCF generated with -ERC GVCF, the information for the region chr7:142458451 is not even present within the reference block, so in the resulting VCF file from genotypeGVCF --include-non-variant-sites, there is no result for that region at all.
Whether it's -ERC BP_RESOLUTION or -ERC GVCF, if genotypeGVCF --include-non-variant-sites is applied, there should be a non-variant result as HOMEREF in the VCF. However, I'm confused why the results differ between the two approaches. Could you please provide more details on this?
The reason stated for recommending -ERC GVCF is solely due to server resources and file size. However, if the analysis results differ, would it be correct to assume that the accuracy of the analysis varies? Additionally, could you please provide guidance on determining which option yields higher accuracy? -
Hi again. Looks like you might have hit a bug present in earlier versions of the tool. Looking at the output of 4.5 here -ERC GVCF also produces HOM_REF sites for each nucleotide position. In this case we recommend you to upgrade your workflow to the latest version.
For the other question you have BP_RESOLUTION would be the most accurate since it keeps depth information individually for each site.
Please sign in to leave a comment.
3 comments