Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Picard Liftover hg38 to hg19 Error.

Answered
0

13 comments

  • Avatar
    Bhanu Gandham

    Hi Wu, Tsung-Jung

     

    Two things to check: (i) contig naming conventions are as expected and match between the data and chain file (e.g. chr1 vs 1, chrM vs MT, etc) and (ii) sort orders of these match between the data and the chain file.

    0
    Comment actions Permalink
  • Avatar
    Scott Kulm

    I am also experiencing this error.  Could you please be more specific with your second point " sort orders of these match between the data and the chain file." Thank you.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Scott Kulm, I believe Bhanu is referring to the fact that the chain file and your data must be sorted in the same order.

    0
    Comment actions Permalink
  • Avatar
    Malachi Griffith

    Even more basic than the good suggestions above...  I got this same error by selecting the wrong reference genome fasta file.  I was doing an hg38 to hg19 conversion.  But for the reference parameter (--REFERENCE_SEQUENCE / -R) I specified my *current* reference  instead of the *TARGET* reference file for the build I was trying to convert to.  

    The docs do state this very clearly, but it is still an easy mistake to make.  Because, all the time one is asked to supply a reference file and you instinctively reach for the one you are usually using, not an older one that you are only converting to because some tool/resource is making you.  :)

    Based on the example command above, I think this is the same mistake that was made by Tsung-Jung Wu.  They specify *hg38ToHg19.over.chain* but then supply what looks like a GRCh38 (hg38) reference instead of the target hg19. 

    The hint that allowed me to catch my mistake was that my command produced a large number of rejected variants before crashing.  All of these rejected variants had a note in the VCF record stating that the reference base did not match the reference allele.  The tool presumably runs partially because both hg19 and hg38 have a "chr1" sequence but then it crashes when it encounters a contig in my incorrect reference that doesn't match up with the chain file I supplied.  So its *kind* of like what happens if the sort order is off.

    Documenting this here for when I make this silly mistake again.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for posting your solution, Malachi Griffith! I'm sure this will be helpful for other users as well.

    0
    Comment actions Permalink
  • Avatar
    Brian Fulton-Howard

    I'm getting the same problem, and I know my vcf file is sorted sequentially, and that all of the target contigs are in the reference fasta. What's more, it only fails on chr18. All of the other chromosomes work with the exact same chain and fasta files.

    I'm using the UCSC chain and fasta files, but I have modified the target contig names (in both the fasta and chain files) such that the names are GATK b37 contig names instead of hg19 contig names.

    Can you please help me? Even being able to isolate what variant is at fault would be amazing. The failure happens after it claims that liftover is done, so maybe it's during sorting?

    The full error follows:
    15:30:32.050 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [Fri May 28 15:30:32 GMT 2021] LiftoverVcf --INPUT temp/sema4Batch1.b38.chr18.UNFILTERED.vcf.gz --OUTPUT temp/sema4Batch1.from-b38_to-b37.chr18.vcf.gz --CHAIN data/ref/b38_to_b37.over.chain.gz --REJECT output/rejects/sema4Batch1.from-b38_to-b37.chr18.rejects.vcf.gz --TMP_DIR temp/liftOver/sema4Batch1.from-b38_to-b37.chr18 --MAX_RECORDS_IN_RAM 7000 --REFERENCE_SEQUENCE data/ref/b37.fa.gz --WARN_ON_MISSING_CONTIG false --LOG_FAILED_INTERVALS true --WRITE_ORIGINAL_POSITION false --WRITE_ORIGINAL_ALLELES false --LIFTOVER_MIN_MATCH 1.0 --ALLOW_MISSING_FIELDS_IN_HEADER false --RECOVER_SWAPPED_REF_ALT false --TAGS_TO_REVERSE AF --TAGS_TO_DROP MAX_AF --DISABLE_SORT false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
    May 28, 2021 3:30:42 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    [Fri May 28 15:30:42 GMT 2021] Executing as fultob01@lh06c31 on Linux 3.10.0-957.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.8.1
    INFO 2021-05-28 15:30:42 LiftoverVcf Loading up the target reference genome.
    INFO 2021-05-28 15:31:09 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
    INFO 2021-05-28 17:34:56 LiftOver Interval chr18:31086560-31086563 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)
    INFO 2021-05-28 18:45:34 LiftOver Interval chr18:46253735-46253738 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)
    [Fri May 28 19:27:50 GMT 2021] picard.vcf.LiftoverVcf done. Elapsed time: 237.31 minutes.
    Runtime.totalMemory()=71274856448
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    java.lang.StringIndexOutOfBoundsException: String index out of range: 3936
    at java.lang.String.checkBounds(String.java:385)
    at java.lang.String.<init>(String.java:324)
    at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301)
    at picard.vcf.LiftoverVcf.tryToAddVariant(LiftoverVcf.java:558)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:453)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    Using GATK jar /gatk/gatk-package-4.1.8.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms74G -jar /gatk/gatk-package-4.1.8.1-local.jar LiftoverVcf --I temp/sema4Batch1.b38.chr18.UNFILTERED.vcf.gz --O temp/sema4Batch1.from-b38_to-b37.chr18.vcf.gz --CHAIN data/ref/b38_to_b37.over.chain.gz --R data/ref/b37.fa.gz --REJECT output/rejects/sema4Batch1.from-b38_to-b37.chr18.rejects.vcf.gz --TMP_DIR temp/liftOver/sema4Batch1.from-b38_to-b37.chr18 --MAX_RECORDS_IN_RAM 7000

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Brian Fulton-Howard,

    This isn't happening after the program is finished running, it looks like your error (java.lang.StringIndexOutOfBoundsException) is causing the command to exit. It's occurring with the interval chr18:46253735-46253738, which is failing to match chain 19. There probably is some issue with your chain file not matching up your intervals with the target intervals properly because there is no match between the sequences (intersection length 1 < minMatchSize 4.0).

    Here is another forum thread that is similar:

    https://gatk.broadinstitute.org/hc/en-us/community/posts/360062428371-LiftoverVcf-fails-to-lift-over-a-GVCF-file-Error-java-lang-ArrayIndexOutOfBoundsException-1

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Brian Fulton-Howard

    Thanks for the quick reply!

    I have done some further troubleshooting and determined that it is NOT due to the source genotype file (it also fails with the HGDP WGS vcf files. Though I am modifying the chain files myself using my own scripts, my chain file is identical to the ones generated for the gnomAD v2.1 liftover.

    A script that will produce an identical file to my chain is here: https://github.com/broadinstitute/gatk/blob/master/scripts/funcotator/data_sources/createLiftoverForHg38ToB37.sh (written by John Smith at the Broad). The only difference is that I am changing the qName field in all of the chain file headers.


    When I lift over the same files using the unmodified chain and fasta files, it successfully completes. but is identical up until the error.

    This is an excerpt from the successful run:

    INFO 2021-06-01 17:45:13 LiftoverVcf Loading up the target reference genome.
    INFO 2021-06-01 17:45:37 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
    INFO 2021-06-01 17:56:29 LiftOver Interval chr18:15781697-15781698 failed to match chain 1430656 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
    INFO 2021-06-01 18:04:13 LiftOver Interval chr18:31086560-31086563 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)
    INFO 2021-06-01 18:10:36 LiftOver Interval chr18:41988848-41988850 failed to match chain 19 because intersection length 2 < minMatchSize 3.0 (0.6666667 < 1.0)
    INFO 2021-06-01 18:12:30 LiftoverVcf read 1,000,000 records. Elapsed time: 00:26:53s. Time for last 1,000,000: 1,612s. Last read position: chr18:45,338,287
    INFO 2021-06-01 18:13:08 LiftOver Interval chr18:46253735-46253738 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)
    INFO 2021-06-01 18:18:15 LiftOver Interval chr18:54542842-54542844 failed to match chain 19 because intersection length 2 < minMatchSize 3.0 (0.6666667 < 1.0)
    INFO 2021-06-01 18:30:59 LiftOver Interval chr18:74622728-74622736 failed to match chain 19 because intersection length 6 < minMatchSize 9.0 (0.6666667 < 1.0)
    INFO 2021-06-01 18:30:59 LiftOver Interval chr18:74622732-74622736 failed to match chain 19 because intersection length 2 < minMatchSize 5.0 (0.4 < 1.0)

    ...

     

    And here is an excerpt from the failed liftover:
    INFO 2021-06-01 14:29:17 LiftoverVcf Loading up the target reference genome.
    INFO 2021-06-01 14:29:41 LiftoverVcf Lifting variants over and sorting (not yet writing the output file.)
    INFO 2021-06-01 14:40:30 LiftOver Interval chr18:15781697-15781698 failed to match chain 1430656 because intersection length 1 < minMatchSize 2.0 (0.5 < 1.0)
    INFO 2021-06-01 14:48:24 LiftOver Interval chr18:31086560-31086563 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)
    INFO 2021-06-01 14:54:45 LiftOver Interval chr18:41988848-41988850 failed to match chain 19 because intersection length 2 < minMatchSize 3.0 (0.6666667 < 1.0)
    INFO 2021-06-01 14:56:40 LiftoverVcf read 1,000,000 records. Elapsed time: 00:26:59s. Time for last 1,000,000: 1,619s. Last read position: chr18:45,338,287
    INFO 2021-06-01 14:57:17 LiftOver Interval chr18:46253735-46253738 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)
    [Tue Jun 01 15:02:24 GMT 2021] picard.vcf.LiftoverVcf done. Elapsed time: 33.26 minutes.
    Runtime.totalMemory()=77942751232
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    java.lang.StringIndexOutOfBoundsException: String index out of range: 4203

    As you can see, the only difference is that it fails after that INFO line.

    I am rerunning my original command replacing my fasta file with the one from the GATK resource bundle (Homo_sapiens_assembly19.fasta) and will let you know the results when that finishes.

    UPDATE: The liftover has gotten past where it fails before. I don't know why it's working with the GATK fasta, but not with the UCSC fasta that just has the contig names changed in the same way as the chain qName fields.

    Thanks,

    Brian

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for posting this information Brian Fulton-Howard. It can be tricky to determine what is causing these issues, but if you get the problem when you modify the qName, then it must be something in that step. 

    How are you modifying the qName? If it's by hand, there could be a formatting issue. Or some other small problem that you are not able to find yourself but causes issues with GATK.

    0
    Comment actions Permalink
  • Avatar
    Brian Fulton-Howard

    I am modifying the qName field programmatically using python3 with a lookup table in pandas. I am pretty sure at this point that it's not the chain file that's the problem though, but the fasta file.

    This is because when I use the b37 fasta file from the GATK resource bundle, there is no error. Also, my chain file is identical to the funcotator chain file. However, my code is flexible and updates the UCSC fasta file, so I would like to get this working. I also use a similar python script for the fasta.

    Interestingly, the UCSC fasta file has the contigs out of order, so maybe that's the issue? It's just strange that it does not fail with the unmodified chain and fasta, which you would think would happen if the order of contigs in the fasta was an issue.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Brian Fulton-Howard how is your script updating the FASTA file?

    0
    Comment actions Permalink
  • Avatar
    Brian Fulton-Howard

    It's a fairly simple algorithm with some modifications for speed. Basically it reads in the fasta file line by line, directly from compressed in binary mode. If the line starts with ">", it will decode the line into ASCII and replace the contig name (first thing after the ">") based on a LUT from the GATK article about the differences between builds. For non hg19/b37 builds, it just removes "chr" and changes M to MT. If the line does not start with ">", it will write the line to output without decoding it. Finally, it uses bgzip and tabix to compress and index the temporary output.

    The code follows (pastes expire in a week):
    Main script: https://pastebin.ubuntu.com/p/kjh2YVbzqP/
    Function to change contig name: https://pastebin.ubuntu.com/p/KwQ6x4FZ3j/
    LUT: https://pastebin.ubuntu.com/p/WXtyTpHjqJ/

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    There could be some small issue with the updating script, and it seems like it might be with chr18:46253735-46253738. I would recommend checking what is going on around there:

    INFO 2021-06-01 14:57:17 LiftOver Interval chr18:46253735-46253738 failed to match chain 19 because intersection length 1 < minMatchSize 4.0 (0.25 < 1.0)

    There could be some issue with how the naming was updated in that location with your script that causes this location not to match the chain file. 

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk