We use the term spanning deletion or overlapping deletion to refer to a deletion that spans a position of interest.
The presence of a spanning deletion affects how we can represent genotypes at any site(s) that it spans for those samples that carry the deletion, whether in heterozygous or homozygous variant form. Page 8, item 5 of the VCF v4.3 specification reserves the *
allele to reference overlapping deletions. This is not to be confused with the bracketed asterisk <*>
used to denote symbolic alternate alleles.
Here we illustrate with four human samples. Bob and Lian each have a heterozygous A
to T
single polymorphism at position 20, our position of interest. Kyra has a 9 bp deletion from position 15 to 23 on both homologous chromosomes that extends across position 20. Lian and Omar each are heterozygous for the same 9 bp deletion. Omar and Bob's other allele is the reference A
.
What are the genotypes for each individual at position 20? For Bob, the reference A and variant T alleles are clearly present for a genotype of A/T
.
What about Lian? Lian has a variant T allele plus a 9 bp deletion overlapping position 20. To notate the deletion as we do single nucleotide deletions is technically inaccurate. We need a placeholder notation to signify absent sequence that extends beyond the position of interest and that is listed for an earlier position, in our case position 14. The solution is to use a star or asterisk *
at position 20 to refer to the spanning deletion. Using this convention, Lian's genotype is T/*
.
At the sample-level, Kyra and Omar would not have records for position 20. However, we are comparing multiple samples and so we indicate the spanning deletion at position 20 with *
. Omar's genotype is A/*
and Kyra's is */*
.
In the VCF, depending on the format used by tools, positions equivalent to our example position 20 may or may not be listed. If listed, such as in the first example VCF shown, the spanning deletion is noted with the asterisk *
under the ALT
column. The spanning deletion is then referred to in the genotype GT
for Kyra, Lian and Omar. Alternatively, a VCF may altogether avoid referencing the spanning deletion by listing the variant with the spanning deletion together with the deletion. This is shown in the second example VCF at position 14.
7 comments
I am getting an error with Genome Analysis Toolkit (GATK) v4.1.4.1
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 78746: unparsable vcf record with allele *CCCCCCCCCGCCCCTCCCCC, for input source: test.vcf.gz
How can I solve this error? The start of the line of the vcf is
NC_036443.1 78983 . ACCCCCCCCCCCCCCCCCCCGCCCCTCCCCC ACCCCCCCCCGCCCCTCCCCC,*CCCCCCCCCGCCCCTCCCCC,ACCCCCCCGCCCCTCCCCC,*
Representing spanning deletion by * is good in itself, but most of the downstream bioinformatics software I know cannot take care of spanning deletion. Therefore, is there a tool to convert * into an INDEL?
Hello, my question is also related to question posted by Degang. I am using gatk4.1.7.0 and I was wondering if there is a flag that could be used to choose between the two VCF formats mentioned in your article (ie. with or without * designation). Thanks.
GATK Team
The VCF entry shown for position 14 seems problematic. Lian is assigned a GT of 0/1 indicating that for the reference sequence at that position (i.e. GCCCCCACCC) one of his haplotypes is the reference allele, which it is not. I wonder what would be the proper way to allow representing Lian's two variant alleles as separate VCF records? Perhaps we need to use <*> like this:
The entry at position 14 is similarly problematic for Bob
why some variant caller do not call these spanning deletions when the "Dels =0.25" and instead call it as a heterozygous SNP. ?
thanks.
As Degang Wu and Sam Khalouei I encounter several difficulties with the spanning or overlapping deletion allele notation (*). It is not recognised by downstream analysis tools, in my case the ANGSD population genetics software. Is there a way to convert them as indel?
The effort of the author describing all these terms is unprecedented! Thank you so much for this
Please sign in to leave a comment.