Picard LiftOverVCF 2.22.3. hs37d5_to_GRCh38. Many mismatched reference alleles
Hi, I am using Picard LiftOverVCF 2.22.3 to liftover a VCF file (hs37d5 reference genome) to the GRCh38 genome
using 21 chromosome of this VCF file as a toy dataset
GRCh37_to_GRCh38 chain file
VCF file has 1,2,3...X,Y chromosome naming
java -jar picard.jar LiftoverVcf R=hs37d5.fa CHAIN=chain/GRCh37_to_GRCh38.chain I=21_chr_ku_vcf.gz O=21 chr_ku.remmaped.vcf.gz REJECT=out.unmapped.vcf
But many of the variants have mismatching reference alleles
example of the output:
ailed to match chain 382 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO 2020-06-22 15:06:53 LiftOver Interval 21:24364134-24364135 failed to match chain 382 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
INFO 2020-06-22 15:07:54 LiftoverVcf Processed 6550685 variants.
INFO 2020-06-22 15:07:54 LiftoverVcf 3055298 variants failed to liftover.
INFO 2020-06-22 15:07:54 LiftoverVcf 2634638 variants lifted over but had mismatching reference alleles after lift over.
INFO 2020-06-22 15:07:54 LiftoverVcf 86.8602% of variants were not successfully lifted over and written to the output.
INFO 2020-06-22 15:07:54 LiftoverVcf liftover success by source contig:
INFO 2020-06-22 15:07:54 LiftoverVcf 21: 860749 / 6550685 (13.1398%)
INFO 2020-06-22 15:07:54 LiftoverVcf lifted variants by target contig:
INFO 2020-06-22 15:07:54 LiftoverVcf 21: 860749
WARNING 2020-06-22 15:07:54 LiftoverVcf 0 variants with a swapped REF/ALT were identified, but were not recovered. See RECOVER_SWAPPED_REF_ALT and associated caveats.
INFO 2020-06-22 15:07:55 LiftoverVcf Writing out sorted records to final VCF.
[Mon Jun 22 15:08:04 GMT 2020] picard.vcf.LiftoverVcf done. Elapsed time: 2.59 minutes.
Runtime.totalMemory()=4265607168
Is the chain file I am using correct? What would be your suggestion to improve the liftover results?
Thanks!
-
Hi.
Try to tun 'bcftools norm` prior to liftovering. It might be of help.
Also, have a look at the rejected VCF file (there is an argument REJECT to specify the file which will contain all rejected VCF entries) after you try normalizing with bcftools. It might help discover the problem.
-
Thank you for the feedback. I will try the bcftools norm. Should I also create a custom h37d5_to_GRCh38 chain file or it is not an option?
-
That was my mistake. I one book on bioinformatics I have read that the reference file should be the one vcf file mapped to. Thus, I did not properly read the original gatk documentation and got so many rejected variants (only 18% liftovered). Changing the reference file to the target fasta (GRCh38) increased the successful liftover rate to ~95%.
-
I am facing the same issue, even though I am using the right target reference sequence.
many variants lifted over but had mismatching reference alleles after liftover. Only about 30% variants were lifted over successfully.
When using bcftools 'norm' prior to LiftOver, it gives an error:
Reference allele mismatch at chr1:743268 .. REF_SEQ:'C' vs VCF:'A'
Also getting: Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.) But I assume this has nothing to do with the failure to liftover.This is what the output of bcftools norm looks like:
##fileformat=VCFv4.2
##FILTER=<ID=CannotLiftOver,Description="Liftover of a variant that needed reverse-complementing failed for unknown reasons.">
##FILTER=<ID=IndelStraddlesMultipleIntevals,Description="Reference allele in Indel is straddling multiple intervals in the chain, and so the results are not well defined.">
##FILTER=<ID=MismatchedRefAllele,Description="Reference allele does not match reference genome sequence after liftover.">
##FILTER=<ID=NoTarget,Description="Variant could not be lifted between genome builds.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AttemptedAlleles,Number=1,Type=String,Description="The alleles of the variant in the TARGET prior to failing due to reference allele mismatching to the target reference.">
##INFO=<ID=AttemptedLocus,Number=1,Type=String,Description="The locus of the variant in the TARGET prior to failing due to reference allele mismatching to the target reference.">
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##contig=<ID=1,length=247177331>
##contig=<ID=10,length=135312574>
##contig=<ID=11,length=134433813>
##contig=<ID=12,length=132288870>....
And so on, Most of the variants have mismatched Ref Alleles.
How should I interpret this? My vcf file is on build hg18, I checked with both hg18 and hg19 reference fasta, in case I was making a mistake, and there aren't any better results with either.
Your help is appreciated.
Thanks!
Please sign in to leave a comment.
4 comments