Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Unparsable vcf record with allele *

Answered
0

10 comments

  • Avatar
    Bhanu Gandham

    Hi Madison Strain

     

    You are using a very old version of GATK3 that we do not support anymore. Please upgrade to the latest version of GATK4.

    0
    Comment actions Permalink
  • Avatar
    Madison Strain

    Hello Bhanu,

    Yes I agree its very old, but this is the only version that performs the VariantAnnotator function. For the rest of my pipeline I use version 4.1.4.0, which only has the BETA version of the VariantAnnotator function.

    Madison

    0
    Comment actions Permalink
  • Avatar
    Madison Strain

    Bhanu Gandham

    Additionally, even the BETA version does not have the AlleleBalanceBySample annotation feature. Which is what I need for this analysis.

    0
    Comment actions Permalink
  • 0
    Comment actions Permalink
  • Avatar
    Matt Armstrong

    Madison Strain did you find a solution for this?

    I am running into a very similar error, "The provided VCF file is malformed at approximately line number 160: unparsable vcf record with allele "G, for input source..."

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Matt Armstrong,

    Could you please provide more information about the full command you are running, the full error message, and the location of the VCF file that is producing the error? Additionally, could you run ValidateVariants on your VCF file to pinpoint any issues with formatting?

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Matt Armstrong

    Hi Pamela,

    Thanks for the response! Here are the commands I am using and the errors I am getting for both LiftoverVcf and ValidateVariants. Please let me know if you need anymore information. 

    I running the following commands on java/1.8.0 and GATK/4.2.0.0 .

    Liftover: 


    java -Xmx15G -jar /sw/hgcc/Pkgs/picardtools/2.6.0/picard.jar LiftoverVcf \
    I=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/adj_mergeSnpIndel.vcf.gz \
    O=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/lifted_over_mergeSnpIndel.vcf \
    CHAIN=/mnt/icebreaker/data2/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg38ToHg19.over.chain \
    REJECT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/rejected_variants.vcf \
    REFERENCE_SEQUENCE=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg19.fa \
    MAX_RECORDS_IN_RAM=500000 \
    WARN_ON_MISSING_CONTIG=true

    Liftover output: 

    [Thu Aug 26 14:48:12 EDT 2021] picard.vcf.LiftoverVcf INPUT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/adj_mergeSnpIndel.vcf.gz OUTPUT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/lifted_over_mergeSnpIndel.vcf CHAIN=/mnt/icebreaker/data2/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg38ToHg19.over.chain REJECT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/rejected_variants.vcf REFERENCE_SEQUENCE=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg19.fa WARN_ON_MISSING_CONTIG=true MAX_RECORDS_IN_RAM=500000 WRITE_ORIGINAL_POSITION=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    [Thu Aug 26 14:48:12 EDT 2021] Executing as marmstrong@node04.local on Linux 3.10.0-1160.36.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.6.0-SNAPSHOT
    INFO 2021-08-26 14:48:13 LiftoverVcf Loading up the target reference genome.
    INFO 2021-08-26 14:48:39 LiftoverVcf Lifting variants over and sorting.
    [Thu Aug 26 14:48:39 EDT 2021] picard.vcf.LiftoverVcf done. Elapsed time: 0.45 minutes.
    Runtime.totalMemory()=6957826048
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 160: unparsable vcf record with allele "G, for input source: /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/adj_mergeSnpIndel.vcf.gz
    at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:783)
    at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:569)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:609)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:539)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:336)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:279)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:257)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:60)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:161)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:194)
    at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:136)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:206)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    ValidateVariants: 

    gatk ValidateVariants \
    -R /home/marmstrong/Jin/genomes/unzip_hg38.fasta \
    -V /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf \
    --warn-on-errors

    ValidateVariants output:

    Using GATK jar /mnt/icebreaker/data2/sw/hgcc/Pkgs/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mnt/icebreaker/data2/sw/hgcc/Pkgs/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar ValidateVariants -R /home/marmstrong/Jin/genomes/unzip_hg38.fasta -V /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf --warn-on-errors
    15:41:31.498 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/icebreaker/data2/sw/hgcc/Pkgs/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Aug 31, 2021 3:41:31 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    15:41:31.660 INFO ValidateVariants - ------------------------------------------------------------
    15:41:31.660 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.2.0.0
    15:41:31.660 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
    15:41:31.661 INFO ValidateVariants - Executing as marmstrong@node02.local on Linux v3.10.0-1160.36.2.el7.x86_64 amd64
    15:41:31.661 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
    15:41:31.661 INFO ValidateVariants - Start Date/Time: August 31, 2021 3:41:31 PM EDT
    15:41:31.661 INFO ValidateVariants - ------------------------------------------------------------
    15:41:31.661 INFO ValidateVariants - ------------------------------------------------------------
    15:41:31.662 INFO ValidateVariants - HTSJDK Version: 2.24.0
    15:41:31.662 INFO ValidateVariants - Picard Version: 2.25.0
    15:41:31.662 INFO ValidateVariants - Built for Spark Version: 2.4.5
    15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    15:41:31.662 INFO ValidateVariants - Deflater: IntelDeflater
    15:41:31.662 INFO ValidateVariants - Inflater: IntelInflater
    15:41:31.662 INFO ValidateVariants - GCS max retries/reopens: 20
    15:41:31.662 INFO ValidateVariants - Requester pays: disabled
    15:41:31.662 INFO ValidateVariants - Initializing engine
    15:41:32.184 INFO FeatureManager - Using codec VCFCodec to read file file:///home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf
    15:41:32.416 INFO ValidateVariants - Done initializing engine
    15:41:32.417 WARN ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
    15:41:32.417 WARN ValidateVariants - Other possible validations will still be performed
    15:41:32.417 INFO ProgressMeter - Starting traversal
    15:41:32.417 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    15:41:32.724 WARN ValidateVariants - ***** Input /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position chr1:115697580 are not observed at all in the sample genotypes *****
    15:41:32.746 WARN ValidateVariants - ***** Input /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position chr1:116490511 are not observed at all in the sample genotypes *****
    15:41:32.788 INFO ValidateVariants - Shutting down engine
    [August 31, 2021 3:41:32 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.02 minutes.
    Runtime.totalMemory()=2063073280
    htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 160: unparsable vcf record with allele "G, for input source: /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf
    at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
    at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:678)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:706)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:648)
    at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:443)
    at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
    at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
    at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:375)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354)
    at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
    at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1058)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Matt Armstrong,

    Thank you for running ValidateVariants and providing this information! It looks like the output confirms that there is an issue with your VCF file at line 160 regarding allele G. You may want to look at this portion of the VCF file and do some troubleshooting to understand the issue and how to fix it before running LiftOverVCF or other GATK tools. Please let me know what you find when examining this portion of the VCF.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Matt Armstrong

    Hi Pamela,

    I've gone through my VCF file and looked at line 160 and it looks fine to me. I've checked other VCF files available, and from what I can tell, the comma separation of alleles in the ALT column appears to be standard and should not be creating an issue? 

    Line 160 from my VCF file:

    chr1
    1.66E+08
    .
    C
    G,T
    .
    PASS
    NS=1343
    GT:GQ
    ./.:1
    ./.:1
    0/0:1
    0/0:1
    0/0:1
    0/0:1
    0/0:1
    ./.:1

    The only thing I can think of is that I might need to add a line in the header of the VCF file that specifies that there will be comma separated alleles in the ALT column? I have not seen this in any VCFs that have comma separated ALT alleles, so I am doubtful that this is the solution. 

    VCF Header:

    ##fileformat=VCFv4.0 $
    ##fileDate=20201119 $
    ##reference=/home/dcutler/hg38/hg38.sdx $
    ##phasing=none $
    ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> $
    ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> $
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

    Thanks for the help,

    Matt

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi Matt Armstrong,

    I can't see any obvious errors with the VCF line. However, it seems that it may have something to do with the comma as the error from ValidateVariants says there is an issue with allele "G," (including the comma). Could you try your solution of specifying the comma-separated alleles in the header? Are there any comma-separated ALT alleles prior to line 160? Because it's possible that the tools are running into the comma and stopping before reading the rest of the lines.

    Kind regards,

    Pamela

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk