Unphased homozyous sites between phased heterozygous sites
Hello,
I am seeing unphased sites sandwiched between phased sites. I am pasting below 4 sites from my vcf file for 1 sample. The phase set 1:403618_A_AT:66,0,2542:403618 extends up to position 403622. Taking the variants alone, the site seem to contain one insertion and one deletion both of which cause frameshifts. Obviously this is not the case when haplotypes are taken into account. They are just consecutive SNPs. I am trying to re-build the haplotype from the phasing data but the sites between the two positions are unphased homozygous while they must be within the same phase set, hence phased.
I am not following the best practices because my data set is not standard. I generated this vcf with BAM -> HaplotypeCaller -> GenomicsDBImport -> GenotypeGvcfs
GATK version used 4.1.4.0
Thank you
Oz
vcf:
---
chr7 403618 . A AT 535.54 . AC=9;AF=0.088;AN=100;BaseQRankSum=0;DP=19268;ExcessHet=4.6847;FS=1.107;InbreedingCoeff=-0.053;MLEAC=10;MLEAF=0.098;MQ=60;MQRankSum=0;QD=1.53;ReadPosRankSum=-0.486;SOR=0.58 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:34,4:38:66:0|1:403618_A_AT:66,0,2542:403618
chr7 403620 . G T 484.77 . AC=0;AF=0.009804;AN=100;BaseQRankSum=0;DP=19268;ExcessHet=3.0103;FS=0;InbreedingCoeff=0.2242;MLEAC=1;MLEAF=0.009804;MQ=60;MQRankSum=0;QD=4.04;ReadPosRankSum=7.14;SOR=0.042 GT:AD:DP:GQ:PL 0/0:38,0:38:99:0,114,1646
chr7 403621 . A G 509.77 . AC=0;AF=0.009804;AN=100;BaseQRankSum=0;DP=19268;ExcessHet=3.0103;FS=0;InbreedingCoeff=0.2242;MLEAC=1;MLEAF=0.009804;MQ=60;MQRankSum=0;QD=4.25;ReadPosRankSum=7.14;SOR=0.042 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:38,0:38:99:.:.:0,114,1646:.
chr7 403622 . AT A 383.89 . AC=9;AF=0.088;AN=100;BaseQRankSum=-0.541;DP=19268;ExcessHet=4.6847;FS=0;InbreedingCoeff=-0.053;MLEAC=8;MLEAF=0.078;MQ=60;MQRankSum=0;QD=1.1;ReadPosRankSum=-0.486;SOR=0.613 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:34,4:38:66:0|1:403618_A_AT:66,0,2542:403618
-
Hi Oz,
I'll need to refer to the dev team for this post, mind explicitly stating what the question or problem is. Also, please post the commands used for HaplotypeCaller, GenomicsDBImport, and GenotypeGvcfs.
-
Hi Beri,
Thanks for the help. The problem is two fold. First, the variant sites that are called are wrong. There is no insertion or deletion, but 4 of 5 consecutive nucleotides are single nucleotide variations. This is likely due to the cost of the indel being lower compared to cost of the SNPs in the alignment step. I believe this type of calls were handled by realigning the reads around indels in the past but I think this is not part of the current variant calling pipelines. This first problem is not very important because it can be corrected by rebuilding the haplotype from phased variants which brings me to the second problem. The 4 sites I posted must be phased because it should not be possible to have unphased sites sandwiched between phased sites. So what the genotypes at positions 403620 and 403621 must be 0|0, but they are 0/0. I guess my question is that why are these genotypes unphased and is there a way to fix it. I am providing the full commands from the vcf header below.
---
##GATKCommandLine=<ID=GenomicsDBImport,CommandLine="GenomicsDBImport --genomicsdb-workspace-path /opt/analysis/my_database.gdb --sample-name-map /opt/analysis/sample_map.txt --max-num-intervals-to-import-in-parallel 20 --intervals /opt/analysis/intervals.list --genomicsdb-segment-size 1048576 --genomicsdb-vcf-buffer-size 16384 --overwrite-existing-genomicsdb-workspace false--batch-size 0 --consolidate false --validate-sample-name-map false --merge-input-intervals false --reader-threads 1 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true--create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.1.4.0",Date="January10, 2020 at 8:51:15 PM GMT">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="GenotypeGVCFs --output /opt/analysis/temp.vcf.gz --heterozygosity 0.2 --standard-min-confidence-threshold-for-calling 0.0 --variant gendb://my_database.gdb --intervals /opt/analysis/intervals.list --reference /opt/species_resources/genomes/genome.fa --include-non-variant-sites false --merge-input-intervals false --input-is-somatic false --tumor-lod-to-emit 3.5 --allele-fraction-error 0.001 --keep-combined-raw-annotations false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --only-output-calls-starting-in-intervals false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.1.4.0",Date="January 10, 2020 at 8:51:29 PM GMT">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --heterozygosity 0.2 --max-alternate-alleles 10 --native-pair-hmm-threads 1 --emit-ref-confidence GVCF --output /opt/analysis/padded_bams/D10-JJJ-1.srt.g.vcf.gz --max-reads-per-alignment-start 0 --intervals /opt/analysis/intervals.list --input /opt/analysis/padded_bams/D10-JJJ-1.srt.bam --reference /opt/species_resources/genomes/genome.fa --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-genotype-count 1024 --sample-ploidy 2 --num-reference-samples-if-no-call 0 --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --use-filtered-reads-for-annotations false --correct-overlapping-quality false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --dont-trim-active-regions false --max-extension 25 --padding-around-indels 150 --padding-around-snps 20 --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --max-num-haplotypes-in-population 128 --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --max-unpruned-variants 100 --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --pair-hmm-gap-continuation-penalty 10 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --native-pair-hmm-use-double-precision false --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --min-base-quality-score 10 --smith-waterman JAVA --max-mnp-distance 0 --force-call-filtered-alleles false --min-assembly-region-size 50 --max-assembly-region-size 300 --assembly-region-padding 100 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.1.4.0",Date="January 10, 2020 at 8:50:14 PM GMT">
-
Hi Ozay,
The team says you should be able to use the output as it currently is; phasing the homozygous variant wouldn’t actually add any information.That said, they'll try to look into why the tool leaves them unphased (it may be easier, or unintentional). Perhaps it would better to output them as phased.
-
Hi Beri,
Thanks for the update.
I am aware that phasing the homozygous variants would not add any information. I can manually create haplotypes for the sites I included in my example. However, the idea is not to do this by hand for each variant and each sample as this would be time consuming, if at all possible. The downstream tools creating haplotypes from variant sites automatically would not be able to use the vcf as it is because of the intervening unphased sites.
Oz
-
Sorry for the inconvenience but we don't phase 0/0 homozygous reference. You might find it helpful to look at the phase set (PS) tag which is located at the end of the variant. For example the set below links the two phased sites listed above. More info about this here.
403618
Out of curiosity what downstream tools are using? The tool being used downstream should account for the homozygous sights.
-
This is not an inconvenience, it is a bug in GATK pipeline. It is OK if you or the team does not have time to deal with it or do not know the source, but it is not true that you "don't phase 0/0 homozygous sites". I have plenty of 0|0 genotypes in the very same file. In any case, thank you for your time looking into this.
-
@OzAy the GATK currently does a very spotty job of phasing because it assembles so many extraneous haplotypes (under the hood, that is -- I'm not talking about the emitted PGT/PID tags) that phasing between alleles gets lost. There's a PR currently in review that overhauls the assembly engine, greatly improving phasing. When this goes in, the ``--linked-de-bruijn-graph`` argument ought to help.
-
David,
Thanks for the information. Can you provide a link to the PR? I'd be interested to take a look and get updates on it.
Oz
Please sign in to leave a comment.
9 comments