Question about handling genotypes during VCF liftOver from hg38 to hg19 using LiftoverVCF command
Dear GATK Team,
Thank you for developing a valuable resource.
Recently, I have been using Picard's LiftoverVCF command to lift over a large VCF file from hg38 to hg19 format. I am aware that the command handles the switch of coordinates, as well as the Ref/Alt allele changes required by the new reference genome. This has been incredibly helpful, and I appreciate your team's attention to detail.
However, I have some questions regarding the genotypes in the dataset. Specifically, I am curious about what happens when there is a switch of major/minor alleles.
For example, if the variant is A/C and the liftOver changes it to C/A, how will the genotypes be affected? I believe that if the VCF is unphased, the homogenous genotypes will switch (0/0 to 1/1, 1/1 to 0/0), since the major/minor alleles have been switched. On the other hand, if the VCF is phased, all genotypes will need to change (0|0 to 1|1, 0|1 to 1|0, 1|0 to 0|1, 1|1 to 0|0).
Find additional examples here -
Although the attached image assumes phased genotypes (which is the case for us), even in the case of unphased genotypes, the alleles (0, 1) may need to be switched at times, as described in the example above. Is this successfully taken care of by Picard's LiftoverVCF tool?
Thank you in advance for any information you can provide on this matter. I appreciate your team's expertise and look forward to hearing from you soon.
On a slightly different note, we have VCF files that are as large as 1.1G in size. Whenever I attempt to run LiftoverVCF from hg38 to hg19, even with 16G in memory, there is a segmentation fault. Is there a way to compute approximate amount of memory required depending on the size of VCF file?
Thank you for your expertise!
The picard code says
If this interval is in the opposite orientation, all alleles and genotypes will be reverse complemented and indels will be left-aligned.
I don't see any explicit mention of phasing, but based on the code I would expect the phasing to be maintained.
I have run this tool myself on real data and I do remember it being "memory hungry", but I don't remember how much I actually required. Sorry.
Hi Laura Gauthier,
Thank you for your response.
I had a concern about whether Picard's LiftoverVcf tool can successfully modify genotypes in situations where REF/ALT allele swaps occur across different reference genomes. In my original post, I mentioned that Reverse Complemented Alleles might need modification of genotypes, but I may have been mistaken.
After successfully running the tool, I can confirm that GATK is able to modify genotypes when REF/ALT allele swaps occur (e.g. 0|0 -> 1|1, 0|1 -> 1|0, 1|0 -> 0|1, 1|1 -> 0|0). It's worth noting that REF/ALT allele swaps are only recovered if the "RECOVER_SWAPPED_REF_ALT" argument is used.
Here are a few additional comments that might be helpful: Please note that these comments are accurate as of April 2023, and may not be applicable in the future.
- I encountered a memory error when attempting to install Picard via conda, so I ended up using CrossMap (v0.6.4) instead, as it required less memory for VCF liftOver. However, I discovered that the genotypes were not successfully modified as expected with CrossMap (v0.6.4).
- I ultimately installed Picard v2.27.5 binaries (https://github.com/broadinstitute/picard/releases/tag/2.27.5). The job took approximately 45 minutes to run for 1,230,000 variants x 4200 samples. Please check end of this post for the code that was used to run it.
- Unfortunately, I cannot determine the amount of memory actually used, as multiple jobs were run in parallel on the HPC cluster.
Code that was used to run:
java -jar /path/to/picard_v2.27.5/picard.jar \
Hope this helps!
Great, that memory benchmarking will likely be helpful to other users in the future -- thanks!
Please sign in to leave a comment.