humanG1Kv37 to hg38 liftover issues
Hey there,
I am trying to lift over a set of older vcf files that were made using humanG1Kv37 (human_g1k_v37.fasta) to hg38. There seems to be an issue with the names of the contigs in the chain file used. Is there a chain file that I can use to go to hg19 before I go to hg38?
[edit] After a decent bit of searching I found a GATK forum (https://gatkforums.broadinstitute.org/gatk/discussion/12523/liftovervcf-chain-file-for-b37-to-hg38) where this file was referenced that goes from b37 directly to hg38 (https://raw.githubusercontent.com/broadinstitute/gatk/master/scripts/funcotator/data_sources/gnomAD/b37ToHg38.over.chain) is this file up to date?
I am currently reading more about how the chain files work and other issues that people have had with liftovers but any help on this would be greatly appreciated.
Version Information:
The Genome Analysis Toolkit (GATK) v4.1.7.0
HTSJDK Version: 2.21.2
Picard Version: 2.21.9
Command Used:
#note that this was part of a larger script, hence the $1 input and the other variables used
java -jar GenomeAnalysisTK4.jar LiftoverVcf \
-I=$1 \
-O=$full_path_output_vcf \
--CHAIN=hg19ToHg38.over_renamed.chain \
--REJECT=$full_path_reject_vcf \
-R=GCF_000001405.39_GRCh38.p13_genomic_renamed.fna
The chain file and genome were altered to slightly (hence the file name addition of _renamed). I made sure that the names of the chromosomes were same as the input VCF file that I am lifting over. Both the chain file and reference uses the notation of "chr1" instead of just "1", so I changed everything to just the numerical representation.
Relevant error log for one of the vcfs that I am trying to liftover:
INFO 2020-06-05 13:07:47 LiftoverVcf Loading up the target reference genome.
INFO 2020-06-05 13:08:31 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
INFO 2020-06-05 13:08:32 LiftOver Interval 1:12948307-13138115 failed to match chain 2 because intersection length 84964 < minMatchSize 189809.0 (0.44762895 < 1.0)
INFO 2020-06-05 13:08:32 LiftOver Interval 1:12948307-13138115 failed to match chain 392 because intersection length 522 < minMatchSize 189809.0 (0.002750133 < 1.0)
INFO 2020-06-05 13:08:32 LiftOver Interval 1:12948307-13138115 failed to match chain 3340 because intersection length 322 < minMatchSize 189809.0 (0.0016964423 < 1.0)
INFO 2020-06-05 13:08:32 LiftOver Interval 1:12948307-13138115 failed to match chain 240 because intersection length 24125 < minMatchSize 189809.0 (0.12710145 < 1.0)
INFO 2020-06-05 13:08:32 LiftOver Interval 1:12948307-13138115 failed to match chain 1769 because intersection length 11594 < minMatchSize 189809.0 (0.061082456 < 1.0)
... ~500 more lines like this
INFO 2020-06-05 13:08:51 LiftoverVcf Processed 18739 variants.
INFO 2020-06-05 13:08:51 LiftoverVcf 18739 variants failed to liftover.
INFO 2020-06-05 13:08:51 LiftoverVcf 0 variants lifted over but had mismatching reference alleles after lift over.
INFO 2020-06-05 13:08:51 LiftoverVcf 100.0000% of variants were not successfully lifted over and written to the output.
INFO 2020-06-05 13:08:51 LiftoverVcf liftover success by source contig:
INFO 2020-06-05 13:08:51 LiftoverVcf 1: 0 / 1401 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 10: 0 / 741 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 11: 0 / 866 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 12: 0 / 830 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 13: 0 / 593 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 14: 0 / 594 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 15: 0 / 499 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 16: 0 / 700 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 17: 0 / 577 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 18: 0 / 464 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 19: 0 / 555 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 2: 0 / 1583 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 20: 0 / 412 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 21: 0 / 222 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 22: 0 / 258 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 3: 0 / 1246 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 4: 0 / 1188 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 5: 0 / 1174 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 6: 0 / 1147 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 7: 0 / 1130 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 8: 0 / 1046 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf 9: 0 / 830 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf X: 0 / 683 (0.0000%)
INFO 2020-06-05 13:08:51 LiftoverVcf lifted variants by target contig:
INFO 2020-06-05 13:08:51 LiftoverVcf no successfully lifted variants
Note that out of all of the vcfs that I tried to lift over this one had the worst results. With that said there are other vcfs that returned different error logs. Such as
ERROR 2020-06-05 13:11:48 LiftoverVcf Encountered a contig, 22_KI270879v1_alt that is not part of the target reference.
-
- "java -jar GenomeAnalysisTK4.jar LiftoverVcf \" This usage tells me you are using GATK3 which we do not support. So please upgrade to the latest version of GATK.
- The latest resources can be found here:https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle or here https://github.com/broadinstitute/gatk/tree/master/scripts/funcotator/data_sources.
If they are not available there then we don't provide it.
-
When I run the following on the jar file it says that I am using version 4:
[millebri@exahead1 wgs_variant_calls]$ java -jar /home/groups/prime-seq/pipeline_tools/bin/GenomeAnalysisTK4.jar --version
The Genome Analysis Toolkit (GATK) v4.1.7.0
HTSJDK Version: 2.21.2
Picard Version: 2.21.9Is GenomeAnalysisTK4 not version 4? If so, that is very confusing.
Thank you for the links to your resources bundles, I will look for what I need in there.
-
No thats not GATK4. Here take a look at this doc: https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4
Please sign in to leave a comment.
3 comments