No headers found in VCF
Hi GATK Team,
I am working on a pipeline that has to write a minimal VCF to be funcotated. However, when I run `Funcotator` or `ValidateVariants`, I got the following error:
```
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
```
Attached a part of mt VCF file (where "#CHROM" does appear):
```
##normal_sample=759d9ab7-8584-45c6-8882-6b64697edfbf
##tumor_sample=e94bafbc-cfcc-4b38-ab40-c87f2c091466
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 759d9ab7-8584-45c6-8882-6b64697edfbf e94bafbc-cfcc-4b38-ab40-c87f2c091466
chr1 946293 . G T . PASS SOMATIC AD 79,0 105,20
```
Just in case there is confusion between whitespace vs tabs, I attached the file with the post.https://drive.google.com/file/d/1ckc_6qTAtepVJM4l9E7w6z_y1rUlf0fq/view?usp=sharing
Thanks!
-
Sorry I found the problem! I was not populating the headers correctly - I missed `##` in some lines. Thanks!
-
Hi Qing Zhang, thank you for the update and posting your solution! I am sure it will be helpful for members of the GATK community in the future.
-
Hi Genevieve,
I am experiencing the same problem while trying to use BaseRecalibrator.
Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
The VCF files I use for reference as '--known-sites' were downloaded from the GATK bucket and I really cannot get my head around this.
My script fails at this stage and a recal_data.table cannot be produced.
I checked the VCF files and they all have the typical header present, with #CHROM as well. As opposed to what Qing was saying, the VCF files looks good with all the lines in the header having ##.
Could you please see if everything looks correct in this chunk of code or if there is a problem with the files in the bucket? Please let me know if you need any other information.
Thank you
M
GATK version used: 4.3.0.0
genome_reference="resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
known_snps="resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf"
known_snps_2="resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf"
known_snps_3="resources_broad_hg38_v0_hapmap_3.3.hg38.vcf"
known_indels="resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf"
known_indels_2="resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf"
gatk BaseRecalibrator \
-I sample.bam \
-R $genome_reference \
--known-sites $known_indels \
--known-sites $known_indels_2 \
--known-sites $known_snps \
--known-sites $known_snps_2 \
--known-sites $known_snps_3 \
-O recal_data.table -
Hello Matteo Costacurta. Its hard to tell from that error message which specific file is throwing the exception (likely our error message could be a little more verbose about the source VCF file). Could I ask you to isolate this issue by running ValidateVariants on each of your known vcf files individually and report back if there were any validation failures? Most likely one of those input files is incorrect in some subtle way that wasn't obvious to visually parse.
If you see that your files validate but continue to throw this header format exception could you let us know and post the entire stacktrace that gets output if you set the environment variable: `$ GATK_STACKTRACE_ON_USER_EXCEPTION=true`. -
Hi James,
I had to re-download all the files from the bucket. I decompressed with bcftools, indexed with IndexFeatureFile and they all validated with ValidateVariants, except
resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf
which says
A USER ERROR has occurred: Input dx21/genomes/hg38/gatk_resources/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf fails strict validation of type ALL: the AC tag has the incorrect number of records at position chr1:833209, 1 vs. 2
For recalibration purposes it's sufficient for me to only use dbSNPs as a reference but I thought you guys might want to have a look at the file and see whether it has an issue.
Cheers
M
-
Hey Matteo Costacurta. Thank You for bringing this to our attention. We will look into what is wrong with our resources files failing to validate and see if we can't fix that for future users.
-
Hi! I`m having the same problem, I try to use the following .vcf files from the GATK bucket:
hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf
hg38_v0_hapmap_3.3.hg38.vcf
hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf
hg38_v0_Homo_sapiens_assembly38.known_indels.vcf
hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf
And in all of them I got the same error message: "Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file"
Thank you
-
It might be a whitespace related issue, I recall seeing something similar here: https://gatk.broadinstitute.org/hc/en-us/community/posts/21369679658267-Tribble-can-t-find-CHROM-header-but-line-is-present
Might want to see whether any solutions mentioned in the linked post works. Otherwise, running ValidateVariants on these would be a good place to start. -
The VCF files that I`m using are from the GATK bucket, should I run the ValidateVariants on those files?
-
Might still be a good idea, just to make sure that whitespaces, etc. don't get mangled during the download process. Your system information and the command-line you are using these files with would also be helpful.
Please sign in to leave a comment.
10 comments