Unparsable vcf record with allele *
AnsweredCan you please provide
a) GATK version used
The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
b) Exact GATK commands used
/usr/bin/java -Xmx4g -jar /usr/analysis/src/GATK/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar \
-R /usr/analysis/data/ngs_references/GRCh38_ncbi/genome.fa \
-T VariantAnnotator \
-I ../bams"$chr".list \
-V dpsnp"$chr".recode.vcf \
-A AlleleBalanceBySample \
-o absnp"$chr".vcf >> absnp"$chr".log
c) The entire error log if applicable.
INFO 22:12:28,202 HelpFormatter - --------------------------------------------------------------------------------
INFO 22:12:28,206 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO 22:12:28,206 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 22:12:28,206 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 22:12:28,212 HelpFormatter - Program Args: -R /usr/analysis/data/ngs_references/GRCh38_ncbi/genome.fa -T VariantAnnotator -I ../bams1.list -V dpsnp1.recode.vcf -A AlleleBalanceB$
INFO 22:12:28,232 HelpFormatter - Executing as mcs88@amino.dhe.duke.edu on Linux 2.6.32-754.28.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_242-b07.
INFO 22:12:28,233 HelpFormatter - Date/Time: 2020/04/15 22:12:28
INFO 22:12:28,233 HelpFormatter - --------------------------------------------------------------------------------
INFO 22:12:28,233 HelpFormatter - --------------------------------------------------------------------------------
INFO 22:12:28,766 GenomeAnalysisEngine - Strictness is SILENT
INFO 22:12:29,559 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 22:12:29,573 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 22:12:30,327 SAMDataSource$SAMReaders - Init 50 BAMs in last 0.75 s, 50 of 55 in 0.75 s / 0.01 m (66.54 tasks/s). 5 remaining with est. completion in 0.08 s / 0.00 m
INFO 22:12:30,389 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.82
INFO 22:12:31,622 GATKRunReport - Uploaded run statistics report to AWS S3
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 718: unparsable vcf record with allele *
##### ERROR ------------------------------------------------------------------------------------------
-
You are using a very old version of GATK3 that we do not support anymore. Please upgrade to the latest version of GATK4.
-
Hello Bhanu,
Yes I agree its very old, but this is the only version that performs the VariantAnnotator function. For the rest of my pipeline I use version 4.1.4.0, which only has the BETA version of the VariantAnnotator function.
Madison
-
Additionally, even the BETA version does not have the AlleleBalanceBySample annotation feature. Which is what I need for this analysis.
-
-
Madison Strain did you find a solution for this?
I am running into a very similar error, "The provided VCF file is malformed at approximately line number 160: unparsable vcf record with allele "G, for input source..."
-
Hi Matt Armstrong,
Could you please provide more information about the full command you are running, the full error message, and the location of the VCF file that is producing the error? Additionally, could you run ValidateVariants on your VCF file to pinpoint any issues with formatting?
Kind regards,
Pamela
-
Hi Pamela,
Thanks for the response! Here are the commands I am using and the errors I am getting for both LiftoverVcf and ValidateVariants. Please let me know if you need anymore information.
I running the following commands on java/1.8.0 and GATK/4.2.0.0 .
Liftover:
java -Xmx15G -jar /sw/hgcc/Pkgs/picardtools/2.6.0/picard.jar LiftoverVcf \
I=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/adj_mergeSnpIndel.vcf.gz \
O=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/lifted_over_mergeSnpIndel.vcf \
CHAIN=/mnt/icebreaker/data2/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg38ToHg19.over.chain \
REJECT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/rejected_variants.vcf \
REFERENCE_SEQUENCE=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg19.fa \
MAX_RECORDS_IN_RAM=500000 \
WARN_ON_MISSING_CONTIG=trueLiftover output:
[Thu Aug 26 14:48:12 EDT 2021] picard.vcf.LiftoverVcf INPUT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/adj_mergeSnpIndel.vcf.gz OUTPUT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/lifted_over_mergeSnpIndel.vcf CHAIN=/mnt/icebreaker/data2/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg38ToHg19.over.chain REJECT=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/rejected_variants.vcf REFERENCE_SEQUENCE=/home/marmstrong/Jin/Tet_target_seq/Thomas/data/hg19.fa WARN_ON_MISSING_CONTIG=true MAX_RECORDS_IN_RAM=500000 WRITE_ORIGINAL_POSITION=false LIFTOVER_MIN_MATCH=1.0 ALLOW_MISSING_FIELDS_IN_HEADER=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Aug 26 14:48:12 EDT 2021] Executing as marmstrong@node04.local on Linux 3.10.0-1160.36.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.6.0-SNAPSHOT
INFO 2021-08-26 14:48:13 LiftoverVcf Loading up the target reference genome.
INFO 2021-08-26 14:48:39 LiftoverVcf Lifting variants over and sorting.
[Thu Aug 26 14:48:39 EDT 2021] picard.vcf.LiftoverVcf done. Elapsed time: 0.45 minutes.
Runtime.totalMemory()=6957826048
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 160: unparsable vcf record with allele "G, for input source: /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/adj_mergeSnpIndel.vcf.gz
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:783)
at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:569)
at htsjdk.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:609)
at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:539)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:336)
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:279)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:257)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:60)
at htsjdk.tribble.TabixFeatureReader$FeatureIterator.readNextRecord(TabixFeatureReader.java:161)
at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:194)
at htsjdk.tribble.TabixFeatureReader$FeatureIterator.next(TabixFeatureReader.java:136)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:206)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)ValidateVariants:
gatk ValidateVariants \
-R /home/marmstrong/Jin/genomes/unzip_hg38.fasta \
-V /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf \
--warn-on-errorsValidateVariants output:
Using GATK jar /mnt/icebreaker/data2/sw/hgcc/Pkgs/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /mnt/icebreaker/data2/sw/hgcc/Pkgs/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar ValidateVariants -R /home/marmstrong/Jin/genomes/unzip_hg38.fasta -V /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf --warn-on-errors
15:41:31.498 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/icebreaker/data2/sw/hgcc/Pkgs/GATK/4.2.0.0/gatk-package-4.2.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Aug 31, 2021 3:41:31 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
15:41:31.660 INFO ValidateVariants - ------------------------------------------------------------
15:41:31.660 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.2.0.0
15:41:31.660 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
15:41:31.661 INFO ValidateVariants - Executing as marmstrong@node02.local on Linux v3.10.0-1160.36.2.el7.x86_64 amd64
15:41:31.661 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
15:41:31.661 INFO ValidateVariants - Start Date/Time: August 31, 2021 3:41:31 PM EDT
15:41:31.661 INFO ValidateVariants - ------------------------------------------------------------
15:41:31.661 INFO ValidateVariants - ------------------------------------------------------------
15:41:31.662 INFO ValidateVariants - HTSJDK Version: 2.24.0
15:41:31.662 INFO ValidateVariants - Picard Version: 2.25.0
15:41:31.662 INFO ValidateVariants - Built for Spark Version: 2.4.5
15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:41:31.662 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:41:31.662 INFO ValidateVariants - Deflater: IntelDeflater
15:41:31.662 INFO ValidateVariants - Inflater: IntelInflater
15:41:31.662 INFO ValidateVariants - GCS max retries/reopens: 20
15:41:31.662 INFO ValidateVariants - Requester pays: disabled
15:41:31.662 INFO ValidateVariants - Initializing engine
15:41:32.184 INFO FeatureManager - Using codec VCFCodec to read file file:///home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf
15:41:32.416 INFO ValidateVariants - Done initializing engine
15:41:32.417 WARN ValidateVariants - IDS validation cannot be done because no DBSNP file was provided
15:41:32.417 WARN ValidateVariants - Other possible validations will still be performed
15:41:32.417 INFO ProgressMeter - Starting traversal
15:41:32.417 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
15:41:32.724 WARN ValidateVariants - ***** Input /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position chr1:115697580 are not observed at all in the sample genotypes *****
15:41:32.746 WARN ValidateVariants - ***** Input /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position chr1:116490511 are not observed at all in the sample genotypes *****
15:41:32.788 INFO ValidateVariants - Shutting down engine
[August 31, 2021 3:41:32 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2063073280
htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 160: unparsable vcf record with allele "G, for input source: /home/marmstrong/Jin/Tet_target_seq/Thomas/data/why/unzip_adj_mergeSnpIndel.vcf
at htsjdk.variant.vcf.AbstractVCFCodec.generateException(AbstractVCFCodec.java:887)
at htsjdk.variant.vcf.AbstractVCFCodec.checkAllele(AbstractVCFCodec.java:678)
at htsjdk.variant.vcf.AbstractVCFCodec.parseSingleAltAllele(AbstractVCFCodec.java:706)
at htsjdk.variant.vcf.AbstractVCFCodec.parseAlleles(AbstractVCFCodec.java:648)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:443)
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:384)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:48)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:70)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:375)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:354)
at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.next(TribbleIndexedFeatureReader.java:315)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1058)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289) -
Hi Matt Armstrong,
Thank you for running ValidateVariants and providing this information! It looks like the output confirms that there is an issue with your VCF file at line 160 regarding allele G. You may want to look at this portion of the VCF file and do some troubleshooting to understand the issue and how to fix it before running LiftOverVCF or other GATK tools. Please let me know what you find when examining this portion of the VCF.
Kind regards,
Pamela
-
Hi Pamela,
I've gone through my VCF file and looked at line 160 and it looks fine to me. I've checked other VCF files available, and from what I can tell, the comma separation of alleles in the ALT column appears to be standard and should not be creating an issue?
Line 160 from my VCF file:
chr1
1.66E+08
.
C
G,T
.
PASS
NS=1343
GT:GQ
./.:1
./.:1
0/0:1
0/0:1
0/0:1
0/0:1
0/0:1
./.:1
The only thing I can think of is that I might need to add a line in the header of the VCF file that specifies that there will be comma separated alleles in the ALT column? I have not seen this in any VCFs that have comma separated ALT alleles, so I am doubtful that this is the solution.
VCF Header:
##fileformat=VCFv4.0 $
##fileDate=20201119 $
##reference=/home/dcutler/hg38/hg38.sdx $
##phasing=none $
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> $
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> $
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">Thanks for the help,
Matt
-
Hi Matt Armstrong,
I can't see any obvious errors with the VCF line. However, it seems that it may have something to do with the comma as the error from ValidateVariants says there is an issue with allele "G," (including the comma). Could you try your solution of specifying the comma-separated alleles in the header? Are there any comma-separated ALT alleles prior to line 160? Because it's possible that the tools are running into the comma and stopping before reading the rest of the lines.
Kind regards,
Pamela
Please sign in to leave a comment.
10 comments