LiftoverVcf: hg19 to hg38 all variants mismatched
openjdk version "1.8.0_112"
OpenJDK Runtime Environment (Zulu 188.8.131.52-linux64) (build 1.8.0_112-b16)
OpenJDK 64-Bit Server VM (Zulu 184.108.40.206-linux64) (build 25.112-b16, mixed mode)
Why do I see (......)?
All variants are written to the rejected variants file, with "MismatchedRefAllele" when using the Broad chain file b37ToHg38.over.chain, or "NoTarget" when using the UCSC chain file hg19ToHg38.over.chain.gz. For both attempts I'm using the UCSC hg38 target sequence, chr*.fa.gz. Any idea what I'm doing wrong?
(Inputs files are UK Biobank .bgen converted to .vcf (4.2) with PLINK 2.0.)
gatk CreateSequenceDictionary -R data/chr22.fa.gz &> logs/liftover_hg19_to_hg38_chr22.log
gatk LiftoverVcf -I data/chr22.vcf.gz -O data/chr22_hg38.vcf.gz -CHAIN data/hg19ToHg38.over.chain.gz -REJECT data/chr22_hg38_liftover_rejected_variants.vcf -R data/chr22.fa.gz &>> logs/liftover_hg19_to_hg38_chr22.log
Hi Ken Hanscombe, what was the original reference used to align your file?
Hi Genevieve Brandt (she/her),
From UKB documentation:
Genotypes were imputed into the dataset using computationally efficient methods combined with the Haplotype Reference Consortium (HRC) and UK10K haplotype resource. This increased the number of testable variants over 100-fold to ~96 million variants, which are stored in the compressed and indexed BGENv1.2 format. The imputed genotypes are aligned to the + strand of the reference and the positions are in GRCh37 coordinates.
From other UKB documentation:
The alleles in the imputation are aligned with REF/ALT, first_allele is the ref allele on the fwd strand.
From Bycroft et al. 2018:
We used the Haplotype Reference Consortium (HRC) data as the main imputation reference panel (...) We also imputed the UK Biobank using the merged UK10K and 1000 Genomes phase 3 reference panels, which has 87,696,888 bi-allelic markers. We combined this imputed data with that from the HRC panel, using the HRC imputation when a SNP was present in both panels. (...) The SNP database (dbSNP) refer- ence SNP (rs) IDs were assigned to as many markers as possible using reference SNP ID lists available from the UCSC genome annotation database for the GRCh37 assembly of the human genome (http:// hgdownload.cse.ucsc.edu/goldenpath/hg19/database/)
Hi Ken Hanscombe,
The LiftOver tool can only work if the chain file you use matches the original reference that was used for the VCF. From what you wrote above, it looks like you used GRCh37, which should be similar to hg19. There is more information about reference versions at this link.
If you are getting an error message that there is "no target", that is most likely referring to your -R reference file not matching your chain file. The -R option should be the target, or the new, reference version.
You should also check your VCFs, chain file, and reference naming conventions to verify that the naming is consistent so that the LiftOver tool will work.
Please sign in to leave a comment.