Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

No results of CollectvariantCallingMetrics

Answered
0

26 comments

  • Avatar
    Genevieve Brandt (she/her)

    Hi Yenan, could you try this with the newest GATK version (4.1.9.0) to see if this issue persists?

    This message: "INFO: Failed to detect whether we are running on Google Compute Engine." is not an error message, you can ignore it. 

    However, this message indicates an issue: "java.lang.NullPointerException

            at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)"

    0
    Comment actions Permalink
  • Avatar
    FranBC

    Dear GATK Team,

    I am having a similar issue to Yenan, when using CollectVariantCallingMetrics, was a cause/workaround ever found for this?

    I have no problem when using this tool with the GRCh38 dbSNP (build 138) vcf file provided in the resource bundle, however whenever I try a different dbSNP build, it throws this error:

    [Fri Jul 23 13:25:03 CEST 2021] picard.vcf.CollectVariantCallingMetrics done. Elapsed time: 70.55 minutes.

    Runtime.totalMemory()=1623195648

    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

    java.lang.NullPointerException: Cannot invoke "htsjdk.samtools.SAMSequenceRecord.getSequenceLength()" because the return value of "htsjdk.samtools.SAMSequenceDictionary.getSequence(String)" is null

    at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)

    at picard.util.DbSnpBitSetUtil.createSnpAndIndelBitSets(DbSnpBitSetUtil.java:131)

    at picard.vcf.CollectVariantCallingMetrics.doWork(CollectVariantCallingMetrics.java:101)

    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)

    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)

    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)

    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)

    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    As a bit of a background, I am trying to use the latest dbSNP release (build 155, GRCh38, GCF_000001405.39) and have tried using GATK version 4.1.9.0 and the latest 4.2.0.0, both having the same problem.

    To prepare the dbSNP file for use with the best practices workflow, I renamed the NCBI chromosome accession numbers  to UCSC style names using bcftools annotate, updated the vcf headers using UpdateVcfSequenceDictionary, and indexed the file using IndexFeatureFile.

    The dbSNP file worked well with both HaplotypeCaller and GenotypeGVCFs, with the rsids overlapping perfectly with those obtained when using the dbSNP resource bundle version.

    Any help with this would be greatly appreciated!

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi FranBC,

    The NullPointerException error likely means that there is a contig present in your vcf file that is not present in your reference dictionary. Could please check to be sure that your vcf headers and reference dictionary have matching chromosomes or follow these instructions to provide those files so the GATK team can look into it?

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    FranBC

    Hi Pamela,

    The VCF contig headers do seem to match, so I've uploaded the files to the GATK FTP server as: CollectVariantCallingMetrics_dbsnp155_FranBC2.tar.gz

    Thanks and best regards,

    Francesca

     

     

    0
    Comment actions Permalink
  • Avatar
    Pamela Bretscher

    Hi FranBC,

    Thank you for uploading this file. I have created a GitHub ticket so the GATK engineers can look into your issue and hopefully find a solution. You can follow the progress here and please let me know if you have any additional questions.

    Kind regards,

    Pamela

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi FranBC,

    I was able to take a look at your bug report and found that there are contigs in your dbsnp file you are using that are not in the standard hg38 reference.

    The issue starts at this point in your dbsnp file:

    ##contig=<ID=HLA-A*01:01:01:01,length=3503,assembly=38>

    The contig HLA-A*01:01:01:01 is not present in the hg38 reference dictionary. You will need to update your files to match each other to use CollectVariantCallingMetrics. 

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    FranBC

    Hi Genevieve,

    Thank you for the prompt reply. Apologies, I thought I had supplied this information - I am actually using the build hg38 + decoys fasta reference file (Homo_sapiens_assembly38.fasta) from the resource bundle which does contain that contig (and all the HLA loci following it) and was used to update the VCF headers in the dbSNP vcf file using Picard's UpdateVcfSequenceDictionary tool (the one supplied by dbSNP does not contain any contigs in the header).

    So I'm afraid that I am still a bit confused because those HLA contigs are also found in the dbSNP VCF file supplied in the resource bundle (Homo_sapiens_assembly38.dbsnp138.vcf) which doesn't generate any errors with CollectVariantCallingMetrics. 

    Thanks and best regards,

    Francesca

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    FranBC I see, thanks for letting me know, could you clarify which file you used? I used the hg38_v0_Homo_sapiens_assembly38.fasta file, I don't have the details about the specific path on google though.

    0
    Comment actions Permalink
  • Avatar
    FranBC

    Genevieve Brandt (she/her) The file I used is:  Homo_sapiens_assembly38.fasta 

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    FranBC I downloaded the Homo_sapiens_assembly38.dict from that specific link and did not find the contig I mentioned above: 

    ##contig=<ID=HLA-A*01:01:01:01,length=3503,assembly=38>

    0
    Comment actions Permalink
  • Avatar
    FranBC

    Genevieve Brandt (she/her) Isn't it this line in the dict file:

    franbc@MacBook-Pro bwa-gatk % grep "HLA" Homo_sapiens_assembly38.dict

    @SQ SN:HLA-A*01:01:01:01 LN:3503 M5:01cd0df602495b044b2c214d69a60aa2 AS:38 UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Ah! You are right! I missed adding a backslash while searching and that's why I didn't find that contig. I'll let you know when I have more information.

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    FranBC I'm sorry for not being able to make more progress but it looks like we will have to wait for a fix for this bug to be able to figure out what is going on.

    I did also want to point out that when running ValidateVariants with your files I encountered multiple errors that you may want to resolve:

    gatk --java-options "-Xmx4G" ValidateVariants -V HTS009.filtered.vcf

    A USER ERROR has occurred: Input HTS009.filtered.vcf fails strict validation of type ALL: the AC tag has the incorrect number of records at position chr1:54406036, 2 vs. 1
    gatk --java-options "-Xmx4G" ValidateVariants -R hg38_v0_Homo_sapiens_assembly38.fasta -V HTS009.filtered.vcf --dbsnp GCF_000001405.39.dbSNP155.GRCh38p13_updatedict.vcf

    A USER ERROR has occurred: Input HTS009.filtered.vcf fails strict validation of type ALL: the rsID rs55880629 for the record at position chr1:1704310 is not in dbSNP

    Thank you for writing into the forum regarding this issue and I am sorry we do not have an easy fix!

    0
    Comment actions Permalink
  • Avatar
    FranBC

    Genevieve Brandt (she/her) No worries, thank you for flagging it as a potential bug!

    Regarding the VCF errors - thank you for running ValidateVariants and highlighting them to me, much appreciated! 

    I have fixed my custom script which splits multiallelic variants in a VCF into separate variants (it correctly adjusted the sample genotypes and AD, but naively didn't adjust the AC,AN,AF,MLEAC,MLEAF and PL fields(!)) to more strictly follow the VCF specification and update those fields accordingly, which has fixed all the ValidateVariants errors  - though for some reason I couldn't reproduce the error you got when including the dbsnp vcf file, on my local laptop and server it's running fine.  

    Thanks for all the amazing help, I'll wait for the fix.

    Best regards,

    Francesca

     

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great! Glad you didn't see the errors with the DBSNP errors on your end!

    0
    Comment actions Permalink
  • Avatar
    Azza Ahmed

    Hi!

    I'm actually getting the same error (using gatk 4.2.1.0), only running with a gvcf file while setting INPUT_GVCF ture as below. While a lot of time is spent processing, no file is produced, and the tool fails with the same NullPointerException message:

    ```

    $ gatk --java-options "-Xms2000m" CollectVariantCallingMetrics --INPUT mysample.g.vcf --OUTPUT mysample.g.vcf.metrics.txt --DBSNP GCF_000001405.39_ucsc-genbank-chrs.gz --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict --THREAD_COUNT 8 --GVCF_INPUT true >> mysample.merge_gvcfs_gatk.manual.log 2>&1

    $ tail -n13 mysample.merge_gvcfs_gatk.manual.log

    INFO 2021-10-04 12:59:14 CollectVariantCallingMetrics Read 1,091,000,000 variants. Elapsed time: 01:20:07s. Time for last 100,000: 0s. Last read position: chr22_KB663609v1_alt:51,502
    [Mon Oct 04 12:59:14 CEST 2021] picard.vcf.CollectVariantCallingMetrics done. Elapsed time: 80.18 minutes.
    Runtime.totalMemory()=4211081216
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    java.lang.NullPointerException
    at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)
    at picard.util.DbSnpBitSetUtil.createSnpAndIndelBitSets(DbSnpBitSetUtil.java:131)
    at picard.vcf.CollectVariantCallingMetrics.doWork(CollectVariantCallingMetrics.java:101)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    ```

    The input file is valid though, and no error/warning arose when checked as per the suggestion here,

    ```

    $ gatk --java-options "-Xmx4G" ValidateVariants -V mysample.g.vcf -R Homo_sapiens_assembly38.fasta --validation-type-to-exclude ALLELES

    ```

    Is it possible that the INPUT_GVCF filter is not activated? Any kind of help is appreciated.

    Thank you much.

    Best,
    Azza

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Azza,

    I think your error message is different than the one posted by FranBC so hopefully we can find a solution for you. Could you verify if your issue persists in the latest version of GATK, 4.2.2.0?

    Thank you,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Azza Ahmed

    Hi Genevieve Brandt (she/her),

    Thank you for the prompt response! Yes, I do confirm the issue is persistent in the latest gatk version:

    ```

    [Wed Oct 06 18:56:06 CEST 2021] picard.vcf.CollectVariantCallingMetrics done. Elapsed time: 78.97 minutes.
    Runtime.totalMemory()=3925868544
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    java.lang.NullPointerException
    at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)
    at picard.util.DbSnpBitSetUtil.createSnpAndIndelBitSets(DbSnpBitSetUtil.java:131)
    at picard.vcf.CollectVariantCallingMetrics.doWork(CollectVariantCallingMetrics.java:101)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    Using GATK jar /home/azzaea/software/gatk/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms2000m -jar /home/azzaea/software/gatk/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar CollectVariantCallingMetrics --INPUT mysample.g.vcf --OUTPUT mysample.g.vcf.metrics.txt --DBSNP GCF_000001405.39_ucsc-genbank-chrs.gz --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict --THREAD_COUNT 8 --GVCF_INPUT true

    ```

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Thanks for the update Azza Ahmed! It looks like the issue is with your dbSNP file and not the input VCF (java.lang.NullPointerException
    at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)). 

    What format is your dbSNP file in? Is it a VCF?

    You can try to solve this error first by validating the dbSNP file with the ValidateVariants command, then, if it is VCF format, you should re-name and re-index the file with .vcf.gz filename extension.

    0
    Comment actions Permalink
  • Avatar
    Azza Ahmed

    You are right, my dbsnp file is problematic. I'm a bit surprised since it worked fine in the BQSR stage upstream. I'm putting the logs from validation and obtaining the dbsnp file below, and I'd really appreciate any pointers as to how to work around this:

    ```

    $ ~/software/gatk/gatk-4.2.2.0/gatk --java-options "-Xmx4G" ValidateVariants -V GCF_000001405.39_ucsc-genbank-chrs.gz -R Homo_sapiens_assembly38.fasta

    :

    [October 8, 2021 at 3:19:08 PM CEST] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 21.56 minutes.
    Runtime.totalMemory()=1059061760
    ***********************************************************************

    A USER ERROR has occurred: Input GCF_000001405.39_ucsc-genbank-chrs.gz fails strict validation of type ALL: the REF allele is incorrect for the record at position chr5:47309449, fasta says N vs. VCF says T

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

    $

    $ zcat GCF_000001405.39_ucsc-genbank-chrs.gz | grep rs1580016009
    chr5 47309449 rs1580016009 T A . . RS=1580016009;dbSNPBuildID=154;SSR=0;VC=SNV;GNO;FREQ=Korea1K:0.9677,0.03226

    $

    ```

    For completion, here is how the dbsnp file was obtained:

    ```

    # Download the global assembly report corresponding to the ref genome used

    # Create the mapping schema: Refseq --> UCSC (or GenBank when nothing available)
    awk -v RS="(\r)?\n" 'BEGIN { FS="\t" } !/^#/ { if ($10 != "na") print $7,$10; else print $7,$5 }' GCF_000001405.39_GRCh38.p13_assembly_report.txt > dbSNP_RefSeq-to-UCSC-or-GenBank_GRCh38.p13.map

    # Rename the chromosomes as per the desired schema
    bcftools annotate --rename-chrs dbSNP_RefSeq-to-UCSC-or-GenBank_GRCh38.p13.map -O z -o GCF_000001405.39_ucsc-genbank-chrs.gz GCF_000001405.39.gz

    # Index the resulting vcf
    gatk IndexFeatureFile -I GCF_000001405.39_ucsc-genbank-chrs.gz

    ```

    Thank you much in advance,

    Azza

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Azza,

    Since it worked with BQSR, it might work with CollectVariantCallingMetrics. First, I would recommend renaming the file to have the file extension .vcf.gz (and then make sure you re-create the index). The wrong file extension may be causing the issue with CollectVariantCallingMetrics.

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Azza Ahmed

    Thank you Genevieve Brandt (she/her). I'm afraid changing the format does not really solve it. Below are my logs:

    ```

    $ mv GCF_000001405.39_ucsc-genbank-chrs.gz GCF_000001405.39_ucsc-genbank-chrs.vcf.gz

    $ gatk IndexFeatureFile -I GCF_000001405.39_ucsc-genbank-chrs.vcf.gz

    $ /home/azzaea/software/gatk/gatk-4.2.2.0/gatk --java-options "-Xms2000m" CollectVariantCallingMetrics --INPUT mysample.g.vcf --OUTPUT mysample.metrics.txt --DBSNP GCF_000001405.39_ucsc-genbank-chrs.vcf.gz --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict --THREAD_COUNT 8 --GVCF_INPUT true

    :

    INFO 2021-10-08 19:37:47 CollectVariantCallingMetrics Read 1,091,000,000 variants. Elapsed time: 01:07:16s. Time for last 100,000: 0s. Last read position: chr22_KB663609v1_alt:51,502
    [Fri Oct 08 19:37:47 CEST 2021] picard.vcf.CollectVariantCallingMetrics done. Elapsed time: 67.31 minutes.
    Runtime.totalMemory()=3519021056
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    java.lang.NullPointerException
    at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)
    at picard.util.DbSnpBitSetUtil.createSnpAndIndelBitSets(DbSnpBitSetUtil.java:131)
    at picard.vcf.CollectVariantCallingMetrics.doWork(CollectVariantCallingMetrics.java:101)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
    Using GATK jar /home/azzaea/software/gatk/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms2000m -jar /home/azzaea/software/gatk/gatk-4.2.2.0/gatk-package-4.2.2.0-local.jar CollectVariantCallingMetrics --INPUT mysampple.g.vcf --OUTPUT mysample.metrics.txt --DBSNP GCF_000001405.39_ucsc-genbank-chrs.vcf.gz --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict --THREAD_COUNT 8 --GVCF_INPUT true

    ```

    0
    Comment actions Permalink
  • Avatar
    Azza Ahmed

    I'm beginning to think it is actually something to do with the dbsnp file. I tried to use `CollectVariantCallingMetrics` on a vcf derived from the gvcf I used above (ie, after passing the gvcf file through GATK's : ` GenomicsDBImport`, `GenotypeGVCFs` (both per chromosome), then `GatherVcfs`). It still fails with a similar message, after spending almost an equal amount of time processing:

    ```

    /home/azzaea/software/gatk/gatk-4.2.2.0/gatk --java-options -Xms2000m CollectVariantCallingMetrics --INPUT mysample.vcf --OUTPUT mysample.collectvcfstats.txt --DBSNP GCF_000001405.39_ucsc-genbank-chrs.vcf.gz --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict --THREAD_COUNT 8

    :

    :

    INFO 2021-10-12 10:51:53 CollectVariantCallingMetrics Read 1,091,000,000 variants. Elapsed time: 01:10:19s. Time for last 100,000: 0s. Last read position: chr22_KB663609v1_alt:51,502
    [Tue Oct 12 10:51:53 CEST 2021] picard.vcf.CollectVariantCallingMetrics done. Elapsed time: 70.34 minutes.
    Runtime.totalMemory()=2105540608
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    java.lang.NullPointerException
    at picard.util.DbSnpBitSetUtil.loadVcf(DbSnpBitSetUtil.java:163)
    at picard.util.DbSnpBitSetUtil.createSnpAndIndelBitSets(DbSnpBitSetUtil.java:131)
    at picard.vcf.CollectVariantCallingMetrics.doWork(CollectVariantCallingMetrics.java:101)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

    ```

    Is there a different way to massage the dbsnp file to pass the REF validation? (it passes standard adherence to vcf format, but fails the REF check- when using `ValidateVariants` as was previously shown) .

    Thank you much,
    Azza

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Hi Azza,

    Thanks for trying those troubleshooting steps to see if the problem could be resolved. My colleague took a second look and identified that the program has a problem with the dbSNP file because there is a contig that is not in your provided reference. Unfortunately, the error message is not specific enough to tell which contig is causing the problem.

    I have created an issue ticket on github so that the developers can add a better error message so that finding the problem is easier: https://github.com/broadinstitute/picard/issues/1732

    For now, you can look more closely at your dbSNP file and resolve any issues where there are contigs not in your reference. I'm sorry there isn't an easier solution at this point, but we will use this case to continue to make GATK better!

    Best,

    Genevieve

    0
    Comment actions Permalink
  • Avatar
    Azza Ahmed

    Great, thank you!

    I took another detour and realized it is possible to get statistics on the vcf from the `VariantsEval` command (with the same dbsnp file I have at hand now). I will leave it as such for now, but happy to try other workarounds with `CollectVariantCallingMetrics`

    0
    Comment actions Permalink
  • Avatar
    Genevieve Brandt (she/her)

    Great! Thanks for providing an update. Once this issue gets resolved, I'll let you know. It will most likely take some time as there are many high priority issues right now.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk