GATK 4.1.7.0 does not annotate ID using dbSNP build 153 VCF
GATK 4.1.7.0
I am trying to annotate a multi-sample VCF with GATK 4.1.7.0 VariantAnnotator using a dbSNP build 153 VCF file provided by NCBI using GRCh38 as a reference.
The dbSNP VCF entry is shown below:
chr1 6469122 rs113541584 TTCCTCCTCCTCCTCCTCC T,TTCCTCC,TTCCTCCTCC,TTCCTCCTCCTCC,TTCCTCCTCCTCCTCC,TTCCTCCTCCTCCTCCTCCTCC,TTCCTCCTCCTCCTCCTCCTCCTCC,TTCCTCCTCCTCCTCCTCCTCCTCCTCC,TTCCTCCTCCTCCTCCTCCTCCTCCTCCTCC . . RS=113541584;dbSNPBuildID=132;SSR=0;GENEINFO=PLEKHG5:57449;VC=INDEL;GNO;FREQ=1000Genomes:.,.,.,.,.,0.09724,.,.,.,.|ALSPAC:.,.,.,.,0.008044,.,.,.,.,.|TWINSUK:.,.,.,.,0.00836,.,.,.,.,.;CLNVI=.,,.,.,,,,,.,.;CLNORIGIN=.,1,.,.,1,1,1,1,.,.;CLNSIG=.,3,.,.,2|2,2|3,0|3|0,0|3|0,.,.;CLNDISDB=.,MedGen:CN169374,.,.,MedGen:CN169374|MedGen:C1970211/Orphanet:206580/OMIM:611067/MedGen:C3809309/Orphanet:369867/OMIM:615376,MedGen:CN169374|MedGen:C0393541,MedGen:C0393541|MedGen:CN169374|MedGen:C1970211/Orphanet:206580/OMIM:611067/MedGen:C3809309/Orphanet:369867/OMIM:615376,MedGen:C0393541|MedGen:CN169374|MedGen:C1970211/Orphanet:206580/OMIM:611067/MedGen:C3809309/Orphanet:369867/OMIM:615376,.,.;CLNDN=.,not_specified,.,.,not_specified|Charcot-Marie-Tooth_disease\x2c_recessive_intermediate_c/Distal_spinal_muscular_atrophy\x2c_autosomal_recessive_4,not_specified|Distal_spinal_muscular_atrophy,Distal_spinal_muscular_atrophy|not_specified|Charcot-Marie-Tooth_disease\x2c_recessive_intermediate_c/Distal_spinal_muscular_atrophy\x2c_autosomal_recessive_4,Distal_spinal_muscular_atrophy|not_specified|Charcot-Marie-Tooth_disease\x2c_recessive_intermediate_c/Distal_spinal_muscular_atrophy\x2c_autosomal_recessive_4,.,.;CLNREVSTAT=.,single,.,.,single|single,mult|single,single|single|single,single|single|single,.,.;CLNACC=.,RCV000606042.1,.,.,RCV000175470.1|RCV000544028.1,RCV000175472.3|RCV000320159.1,RCV000373626.1|RCV000483074.1|RCV000534497.2,RCV000281434.1|RCV000605126.1|RCV000688917.1,.,.;CLNHGVS=NC_000001.11:g.6469123_6469148=,NC_000001.11:g.6469125_6469127CTC[2],NC_000001.11:g.6469125_6469127CTC[4],NC_000001.11:g.6469125_6469127CTC[5],NC_000001.11:g.6469125_6469127CTC[6],NC_000001.11:g.6469125_6469127CTC[7],NC_000001.11:g.6469125_6469127CTC[9],NC_000001.11:g.6469125_6469127CTC[10],NC_000001.11:g.6469125_6469127CTC[11],NC_000001.11:g.6469125_6469127CTC[12]
The multi-sample VCF entry looks like this:
chr1 6469122 . T TTCC . . ALLELE_A=0;ALLELE_B=1;ASSAY_TYPE=0;BEADSET_ID=2061;FRAC_A=0.0813008;FRAC_C=0.439024;FRAC_G=0.170732;FRAC_T=0.308943;GC_SCORE=0.590469;NORM_ID=5;N_AA=0;N_AB=0;N_BB=472;devR_AA=0.215838;devR_AB=0.332669;devR_BB=0.244361;devTHETA_AA=0.0223607;devTHETA_AB=0.0223607;devTHETA_BB=0.0244666;meanR_AA=2.2749;meanR_AB=2.94805;meanR_BB=2.05585;meanTHETA_AA=0.0179602;meanTHETA_AB=0.368183;meanTHETA_BB=0.718406 GT:BAF:GQ:IGC:LRR:NORMX:NORMY:R:THETA:X:Y 1/1:0.940894:1:0.203725:-0.0600903:0.740701:1.33245:2.07315:0.677006:10332:75301/1:0.955561:1:0.203725:-0.106186:0.69133:1.29231:1.98364:0.68728:9660:8623 1/1:0.885949:1:0.186081:-0.122022:0.808562:1.26757:2.07613:0.63852:10504:7114 1/1:0.935345:1:0.203725:-0.00709165:0.779058:1.38152:2.16058:0.673119:12184:11007 1/1:0.943678:1:0.203725:-0.045608:0.742991:1.34626:2.08925:0.678956:11283:10035 1/1:0.905459:1:0.202323:-0.021832:0.828566:1.36258:2.19114:0.652185:11567:11332 1/1:0.962606:1:0.203725:-0.163742:0.652345:1.2425:1.89484:0.692214:9565:8617 1/1:0.932604:1:0.203725:-0.133853:0.718355:1.26494:1.9833:0.671199:10669:7779 1/1:0.962261:1:0.203725:-0.126659:0.669929:1.27481:1.94474:0.691972:9650:7007 1/1:0.924675:1:0.203725:-0.161304:0.718569:1.24:1.95857:0.665645:10704:8200
Clearly, the observed variation T -> TTCC (insTCC) is equal to one of the dbSNP variations in the VCF: TTCCTCCTCCTCCTCCTCC -> TTCCTCCTCCTCCTCCTCCTCC (insTCC). However, the following command does not manage to retrieve rsID rs113541584 and annotate the multi-sample VCF with it.
This happens wherever a complex variation is observed (AB -> A,ABC).
/home/wgs/Tools/gatk-4.1.7.0/gatk --java-options "-Xmx30G -Xms1G" VariantAnnotator --reference /home/wgs/Genomes/hg38/bwa/hg38.fa --dbsnp /home/wgs/Tools/Supplementary/dbsnp-153-hgvs.sorted.hg38.vcf.gz --variant ./barcode.raw.noID.vcf.gz --output ./barcode.raw.gatk_annotated_latest.vcf.gz
Obviously, such variations were not taken into account. It would be nice to hear that VariantAnnotator will be able to process such variations correctly in the future releases.
-
Hi danilovkiri
We are looking into fixing this. Will share a PR with you for it soon. Stay tuned!
-
Thank you Bhanu Gandham
In case it might be helpful for other users, the quickest and the easiest workaround here it is to use
bcftools norm --fasta-ref <path_to_fasta> -c wx --multiallelic - <input_vcf>
in order to split all multiallelic sites in dbSNP VCF and normalize them. The resulting VCF can be used for annotation without any of the above-mentioned problems.
-
Hi danilovkiri
Thank you for sharing the workaround! And here is the PR we promised: https://github.com/broadinstitute/gatk/pull/6626
Please sign in to leave a comment.
3 comments