Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Recal file with malformed header

0

9 comments

  • Avatar
    Bhanu Gandham

    Hi Erik Fasterius

     

    1. Seems like your recal file was malformed. Maybe while running VariantRecalibrator you ran out of memory or something caused the tool to generate a malformed recal table. Can you recreate your recal file and try again?
    2. If you still see the same error, can you please share a few records from your recal file.
    0
    Comment actions Permalink
  • Avatar
    Erik Fasterius

    Hi, Bhanu!

    I already tried to re-create the recal file, and still get the same error when running ApplyVQSR. Here is parts of the recal header (shortened for brevity with its tranches and contigs) plus some records:

    ##fileformat=VCFv4.2
    ##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
    ##GATKCommandLine=<ID=VariantRecalibrator,CommandLine="VariantRecalibrator --mode INDEL --max-gaussians 4 --resource:mills,known=false,training=true,truth=true,prior=12 mills:/castor/project/proj/nbis-analysis/data/annotations/Mills_and_1000G_gold_standard.indels.b37.vcf --resource:dbsnp,known=true,training=false,truth=false,prior=2 dbsnp:/castor/project/proj/nbis-analysis/data/annotations/dbsnp_138.b37.vcf --output results/vqsr/jointGT.7of7-1.vqsr.indels.recal --tranches-file results/vqsr/jointGT.7of7-1.vqsr.indels.tranches --use-annotation FS --use-annotation ReadPosRankSum --use-annotation MQRankSum --use-annotation QD --use-annotation SOR --use-annotation DP --truth-sensitivity-tranche 100.0 --truth-sensitivity-tranche 99.95 --truth-sensitivity-tranche 99.9 --truth-sensitivity-tranche 99.0 --truth-sensitivity-tranche 97.0 --truth-sensitivity-tranche 95.0 --truth-sensitivity-tranche 90.0 --trust-all-polymorphic true --variant results/jointGT.7of7-1.ann.vcf.gz --use-allele-specific-annotations false --max-negative-gaussians 2 --max-iterations 150 --k-means-iterations 100 --standard-deviation-threshold 10.0 --shrinkage 1.0 --dirichlet 0.001 --prior-counts 20.0 --maximum-training-variants 2500000 --minimum-bad-variants 1000 --bad-lod-score-cutoff -5.0 --mq-cap-for-logit-jitter-transform 0 --mq-jitter 0.05 --debug-stdev-thresholding false --target-titv 2.15 --ignore-all-filters false --sample-every-Nth-variant 1 --output-tranches-for-scatter false --vqslod-tranche 10.0 --vqslod-tranche 9.9 --vqslod-tranche 9.8 --vqslod-tranche 9.700000000000001 (... lots and lots of tranche commands with many decimals ...) --replicate 200 --max-attempts 1 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false",Version="4.1.7.0",Date="14 May 2020 08:00:43 CEST">
    ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
    ##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
    ##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
    ##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
    ##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out">
    ##contig=<ID=1,length=249250621>
    (...)
    ##contig=<ID=NC_007605,length=171823>
    ##contig=<ID=hs37d5,length=35477943>
    ##source=VariantRecalibrator
    #CHROM POS ID REF ALT QUAL FILTER INFO
    1 10114 . N <VQSR> . . END=10115;NEGATIVE_TRAIN_SITE;VQSLOD=-1.8269;culprit=DP
    1 10146 . N <VQSR> . . END=10147;NEGATIVE_TRAIN_SITE;VQSLOD=-1.6088;culprit=MQRankSum
    1 10234 . N <VQSR> . . END=10235;NEGATIVE_TRAIN_SITE;VQSLOD=-0.9603;culprit=MQRankSum
    1 10403 . N <VQSR> . . END=10440;NEGATIVE_TRAIN_SITE;VQSLOD=-1.7045;culprit=DP
    1 10439 . N <VQSR> . . END=10440;NEGATIVE_TRAIN_SITE;VQSLOD=-1.5845;culprit=SOR
    1 10616 . N <VQSR> . . END=10637;NEGATIVE_TRAIN_SITE;VQSLOD=-1.7792;culprit=SOR
    1 10815 . N <VQSR> . . END=10815;NEGATIVE_TRAIN_SITE;VQSLOD=-0.9333;culprit=DP
    1 13656 . N <VQSR> . . END=13658;NEGATIVE_TRAIN_SITE;VQSLOD=-2.0377;culprit=DP
    1 13957 . N <VQSR> . . END=13958;NEGATIVE_TRAIN_SITE;VQSLOD=-1.3940;culprit=MQRankSum
    1 15219 . N <VQSR> . . END=15230;NEGATIVE_TRAIN_SITE;VQSLOD=-1.5739;culprit=DP
    1 15903 . N <VQSR> . . END=15903;NEGATIVE_TRAIN_SITE;VQSLOD=-1.5544;culprit=DP
    1 16911 . N <VQSR> . . END=16912;NEGATIVE_TRAIN_SITE;VQSLOD=-2.1121;culprit=DP
    1 17961 . N <VQSR> . . END=17962;NEGATIVE_TRAIN_SITE;VQSLOD=-1.5606;culprit=DP
    1 19190 . N <VQSR> . . END=19191;NEGATIVE_TRAIN_SITE;VQSLOD=-2.2864;culprit=DP

    It looked weird to me that there was so many tranche commands, since they were not the ones that I entered, but I have no clue if that is expected behaviour (more tranches than entered are used so that all those entered can be output or something?), but I thought I'd mention it. They are written like so:
    --vqslod-tranche 9.700000000000001 --vqslod-tranche 9.600000000000001 --vqslod-tranche 9.500000000000002 --vqslod-tranche 9.400000000000002 --vqslod-tranche 9.300000000000002 --vqslod-tranche 9.200000000000003 --vqslod-tranche 9.100000000000003 --vqslod-tranche 9.000000000000004 --vqslod-tranche 8.900000000000004 --vqslod-tranche 8.800000000000004 --vqslod-tranche 8.700000000000005 --vqslod-tranche 8.600000000000005 --vqslod-tranche 8.500000000000005 --vqslod-tranche 8.400000000000006 --vqslod-tranche 8.300000000000006 --vqslod-tranche 8.200000000000006 --vqslod-tranche 8.100000000000007 --vqslod-tranche 8.000000000000007 --vqslod-tranche 7.9000000000000075 --vqslod-tranche 7.800000000000008 --vqslod-tranche 7.700000000000008

    Though with many more. Again, don't know if that is fine, but that was the one thing that stood out to me.

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Erik Fasterius

     

    So we see this error when the vcf index file is malformed. Try removing the vcf.idx and run IndexFeatureFile tool and re-index the vcf and then run ApplyVQSR. That should resolve this.

    0
    Comment actions Permalink
  • Avatar
    Erik Fasterius

    I've now removed the index file (which was a tabix file, .tbi, not .idx) and run the IndexFeatureFile, which also created a tabix file rather than .idx. The command ran successfully:

    $ gatk IndexFeatureFile -I jointGT.7of7-1.ann.vcf.gz
    Using GATK jar /castor/project/proj/nbis-analysis/5076-env/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /castor/project/proj/nbis-analysis/5076-env/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar IndexFeatureFile -I jointGT.7of7-1.ann.vcf.gz
    10:57:20.225 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/castor/project/proj/nbis-analysis/5076-env/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    May 19, 2020 10:57:20 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    10:57:20.632 INFO IndexFeatureFile - ------------------------------------------------------------
    10:57:20.632 INFO IndexFeatureFile - The Genome Analysis Toolkit (GATK) v4.1.7.0
    10:57:20.632 INFO IndexFeatureFile - For support and documentation go to https://software.broadinstitute.org/gatk/
    10:57:20.634 INFO IndexFeatureFile - Executing as erikfas@sens2020519-bianca.uppmax.uu.se on Linux v3.10.0-1127.el7.x86_64 amd64
    10:57:20.634 INFO IndexFeatureFile - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
    10:57:20.634 INFO IndexFeatureFile - Start Date/Time: 19 May 2020 10:57:20 CEST
    10:57:20.634 INFO IndexFeatureFile - ------------------------------------------------------------
    10:57:20.634 INFO IndexFeatureFile - ------------------------------------------------------------
    10:57:20.635 INFO IndexFeatureFile - HTSJDK Version: 2.21.2
    10:57:20.635 INFO IndexFeatureFile - Picard Version: 2.21.9
    10:57:20.635 INFO IndexFeatureFile - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    10:57:20.635 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    10:57:20.635 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    10:57:20.635 INFO IndexFeatureFile - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    10:57:20.635 INFO IndexFeatureFile - Deflater: IntelDeflater
    10:57:20.635 INFO IndexFeatureFile - Inflater: IntelInflater
    10:57:20.635 INFO IndexFeatureFile - GCS max retries/reopens: 20
    10:57:20.635 INFO IndexFeatureFile - Requester pays: disabled
    10:57:20.636 INFO IndexFeatureFile - Initializing engine
    10:57:20.636 INFO IndexFeatureFile - Done initializing engine
    10:57:21.021 INFO FeatureManager - Using codec VCFCodec to read file file:///castor/project/proj/nbis-analysis/results/jointGT.7of7-1.ann.vcf.gz
    10:57:21.064 INFO ProgressMeter - Starting traversal
    10:57:21.064 INFO ProgressMeter - Current Locus Elapsed Minutes Records Processed Records/Minute
    10:57:31.075 INFO ProgressMeter - 2:31269948 0.2 684000 4100719.4
    10:57:41.075 INFO ProgressMeter - 3:112831524 0.3 1503000 4506521.4
    10:57:51.076 INFO ProgressMeter - 5:23774328 0.5 2325000 4648295.6
    10:58:01.077 INFO ProgressMeter - 6:159886164 0.7 3146000 4717466.8
    10:58:11.081 INFO ProgressMeter - 8:132734703 0.8 3966000 4757582.4
    10:58:21.081 INFO ProgressMeter - 11:30287046 1.0 4812000 4810637.0
    10:58:31.090 INFO ProgressMeter - 13:105287826 1.2 5647000 4838488.6
    10:58:41.099 INFO ProgressMeter - 17:12243983 1.3 6426000 4817452.6
    10:58:51.109 INFO ProgressMeter - 21:32905354 1.5 7205000 4800986.2
    10:58:56.460 INFO ProgressMeter - GL000192.1:534015 1.6 7666139 4821723.8
    10:58:56.460 INFO ProgressMeter - Traversal complete. Processed 7666139 total records in 1.6 minutes.
    10:58:57.034 INFO IndexFeatureFile - Successfully wrote index to /castor/project/proj/nbis-analysis/results/jointGT.7of7-1.ann.vcf.gz.tbi
    10:58:57.034 INFO IndexFeatureFile - Shutting down engine
    [19 May 2020 10:58:57 CEST] org.broadinstitute.hellbender.tools.IndexFeatureFile done. Elapsed time: 1.61 minutes.
    Runtime.totalMemory()=87293952
    Tool returned:
    /castor/project/proj/nbis-analysis/results/jointGT.7of7-1.ann.vcf.gz.tbi

    I then tried to re-run the ApplyVQSR, but I got exactly the same error:

    11:02:06.900 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/castor/project/proj/nbis-analysis/5076-env/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    May 19, 2020 11:02:07 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    11:02:07.924 INFO ApplyVQSR - ------------------------------------------------------------
    11:02:07.924 INFO ApplyVQSR - The Genome Analysis Toolkit (GATK) v4.1.7.0
    11:02:07.925 INFO ApplyVQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
    11:02:07.926 INFO ApplyVQSR - Executing as erikfas@sens2020519-bianca.uppmax.uu.se on Linux v3.10.0-1127.el7.x86_64 amd64
    11:02:07.926 INFO ApplyVQSR - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_192-b01
    11:02:07.927 INFO ApplyVQSR - Start Date/Time: 19 May 2020 11:02:06 CEST
    11:02:07.927 INFO ApplyVQSR - ------------------------------------------------------------
    11:02:07.927 INFO ApplyVQSR - ------------------------------------------------------------
    11:02:07.929 INFO ApplyVQSR - HTSJDK Version: 2.21.2
    11:02:07.929 INFO ApplyVQSR - Picard Version: 2.21.9
    11:02:07.929 INFO ApplyVQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    11:02:07.930 INFO ApplyVQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    11:02:07.930 INFO ApplyVQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    11:02:07.930 INFO ApplyVQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    11:02:07.931 INFO ApplyVQSR - Deflater: IntelDeflater
    11:02:07.931 INFO ApplyVQSR - Inflater: IntelInflater
    11:02:07.931 INFO ApplyVQSR - GCS max retries/reopens: 20
    11:02:07.932 INFO ApplyVQSR - Requester pays: disabled
    11:02:07.932 INFO ApplyVQSR - Initializing engine
    11:02:08.443 INFO FeatureManager - Using codec VCFCodec to read file file:///castor/project/proj/nbis-analysis/results/vqsr/jointGT.7of7-1.vqsr.indels.recal
    11:02:08.589 INFO FeatureManager - Using codec VCFCodec to read file file:///castor/project/proj/nbis-analysis/results/jointGT.7of7-1.ann.vcf.gz
    11:02:09.481 INFO ApplyVQSR - Done initializing engine
    11:02:09.559 INFO ApplyVQSR - Shutting down engine
    [19 May 2020 11:02:09 CEST] org.broadinstitute.hellbender.tools.walkers.vqsr.ApplyVQSR done. Elapsed time: 0.05 minutes.
    Runtime.totalMemory()=6227755008
    ***********************************************************************

    A USER ERROR has occurred: File /castor/project/proj/nbis-analysis/results/vqsr/jointGT.7of7-1.vqsr.indels.recal is malformed: Expected 11 elements in header line 1 10114 . N <VQSR> . . END=10115;NEGATIVE_TRAIN_SITE;VQSLOD=-1.8269;culprit=DP

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
    Using GATK jar /castor/project/proj/nbis-analysis/5076-env/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx6G -Xms6G -jar /castor/project/proj/nbis-analysis/5076-env/share/gatk4-4.1.7.0-0/gatk-package-4.1.7.0-local.jar ApplyVQSR -V results/jointGT.7of7-1.ann.vcf.gz --recal-file results/vqsr/jointGT.7of7-1.vqsr.indels.recal --tranches-file results/vqsr/jointGT.7of7-1.vqsr.indels.recal --truth-sensitivity-filter-level 90.0 --create-output-variant-index true -mode INDEL -O results//vqsr/vqsr.indel-applied.jointGT.7of7-1.vcf

    So, that didn't work. What else could be the problem? Or is it something related to it being a tabix file rather than index, like you said it should be?

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Can you reindex you jointGT.7of7-1.vqsr.indels.recal file and try again?

    0
    Comment actions Permalink
  • Avatar
    Erik Fasterius

    Just did, for both recal files (which indeed gives .idx files rather than .tbi), but the exact same error remains. I have also tried re-running the the entire process, but the results are the same :-(

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Hi Erik Fasterius

     

    Seems like incorrect inputs provided to ApplyVQSR is causing this issue.

    1. The recal file and tranches file are not the same. The invocation for --tranches-file {input.indels_recal} should be something like --tranches-file {indels.tranches} instead of --tranches-file {input.indels_recal}which is seen in the command you shared with us.

    2. We expect a tranches file to look like this:
      # Variant quality score tranches file
      # Version number 5
      targetTruthSensitivity,numKnown,numNovel,knownTiTv,novelTiTv,minVQSLod,filterName,model,accessibleTruthSites,callsAtTruthSites,truthSensitivity
      90.00,45054,9606,2.4320,2.2447,17.9056,VQSRTrancheSNP0.00to90.00,SNP,18665,16798,0.9000
      99.00,58388,71324,2.3497,2.2484,2.6101,VQSRTrancheSNP90.00to99.00,SNP,18665,18478,0.9900
      99.90,59200,71948,2.3147,2.2346,-1.4635,VQSRTrancheSNP99.00to99.90,SNP,18665,18646,0.9990
      100.00,59500,72321,2.2991,2.2239,-191.2854,VQSRTrancheSNP99.90to100.00,SNP,18665,18665,1.0000
    0
    Comment actions Permalink
  • Avatar
    Erik Fasterius

    Ah, that was indeed the problem! I was not giving it the tranches file, but rather the recal file once again. Definitely a user error here, totally my fault, but I wonder why the error message is so uninformative here. The problem is clearly that I gave it the wrong file, and now that I think about it it does say that it requires 11 columns. It would probably be useful to add something about checking if it really is a tranches file? Or maybe I'm just stupid for not realizing this myself :-P

    0
    Comment actions Permalink
  • Avatar
    Bhanu Gandham

    Nah it was an honest mistake. I didn't see it at first too. Glad it works now though.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk