Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

No headers found in VCF

0

10 comments

  • Avatar
    Qing Zhang

    Sorry I found the problem! I was not populating the headers correctly - I missed `##` in some lines. Thanks!

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Qing Zhang, thank you for the update and posting your solution! I am sure it will be helpful for members of the GATK community in the future.

    0
    Comment actions Permalink
  • Avatar
    Matteo Costacurta

    Hi Genevieve, 

    I am experiencing the same problem while trying to use BaseRecalibrator.

    Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

    The VCF files I use for reference as '--known-sites' were downloaded from the GATK bucket and I really cannot get my head around this. 

    My script fails at this stage and a recal_data.table cannot be produced.

    I checked the VCF files and they all have the typical header present, with #CHROM as well. As opposed to what Qing was saying, the VCF files looks good with all the lines in the header having ##.

    Could you please see if everything looks correct in this chunk of code or if there is a problem with the files in the bucket? Please let me know if you need any other information.

    Thank you 

    M

    GATK version used: 4.3.0.0

    genome_reference="resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
    known_snps="resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf"
    known_snps_2="resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf"
    known_snps_3="resources_broad_hg38_v0_hapmap_3.3.hg38.vcf"
    known_indels="resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf"
    known_indels_2="resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf"

    gatk BaseRecalibrator \
          -I sample.bam \
            -R $genome_reference \
            --known-sites $known_indels \
            --known-sites $known_indels_2 \
            --known-sites $known_snps \
            --known-sites $known_snps_2 \
            --known-sites $known_snps_3 \
          -O recal_data.table
    0
    Comment actions Permalink
  • Avatar
    James Emery

    Hello Matteo Costacurta. Its hard to tell from that error message which specific file is throwing the exception (likely our error message could be a little more verbose about the source VCF file). Could I ask you to isolate this issue by running ValidateVariants on each of your known vcf files individually and report back if there were any validation failures? Most likely one of those input files is incorrect in some subtle way that wasn't obvious to visually parse.

    If you see that your files validate but continue to throw this header format exception could you let us know and post the entire stacktrace that gets output if you set the environment variable: `$ GATK_STACKTRACE_ON_USER_EXCEPTION=true`. 

    0
    Comment actions Permalink
  • Avatar
    Matteo Costacurta

    Hi James, 

    I had to re-download all the files from the bucket. I decompressed with bcftools, indexed with IndexFeatureFile and they all validated with ValidateVariants, except 

    resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf

    which says 

    A USER ERROR has occurred: Input dx21/genomes/hg38/gatk_resources/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf fails strict validation of type ALL: the AC tag has the incorrect number of records at position chr1:833209, 1 vs. 2

    For recalibration purposes it's sufficient for me to only use dbSNPs as a reference but I thought you guys might want to have a look at the file and see whether it has an issue.

    Cheers

    M

    0
    Comment actions Permalink
  • Avatar
    James Emery

    Hey Matteo Costacurta. Thank You for bringing this to our attention. We will look into what is wrong with our resources files failing to validate and see if we can't fix that for future users. 

    0
    Comment actions Permalink
  • Avatar
    Isadora Machado Ghilardi

    Hi! I`m having the same problem, I try to use the following .vcf files from the GATK bucket: 

    hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf

    hg38_v0_hapmap_3.3.hg38.vcf

    hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf

    hg38_v0_Homo_sapiens_assembly38.known_indels.vcf

    hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf

    And in all of them I got the same error message: "Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file"

     

    Thank you 

     

    0
    Comment actions Permalink
  • Avatar
    Can Kockan

    It might be a whitespace related issue, I recall seeing something similar here: https://gatk.broadinstitute.org/hc/en-us/community/posts/21369679658267-Tribble-can-t-find-CHROM-header-but-line-is-present

    Might want to see whether any solutions mentioned in the linked post works. Otherwise, running ValidateVariants on these would be a good place to start.

    0
    Comment actions Permalink
  • Avatar
    Isadora Machado Ghilardi

    The VCF files that I`m using are from the GATK bucket, should I run the ValidateVariants on those files? 

    0
    Comment actions Permalink
  • Avatar
    Can Kockan

    Might still be a good idea, just to make sure that whitespaces, etc. don't get mangled during the download process. Your system information and the command-line you are using these files with would also be helpful.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk