Genome Analysis Toolkit

Variant Discovery in High-Throughput Sequencing Data

GATK process banner

Need Help?

Search our documentation

Community Forum

Hi, How can we help?

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. Learn more

Missing RawGtCount annoation from GenotypeGVCF output

0

19 comments

  • Avatar
    Megan Shand

    Hello,

    I'd highly recommend using version 4.3.0.0 for this since that was the release that includes Ultima Genomics support. That said, I'm not sure why this annotation wouldn't be getting added when you use version 4.3.0.0. Did you try using that version for all of the steps including GenomicsDBImport? Would it be possible for you to share a very small snippet of the input GVCFs you have so I can try to reproduce the issue to debug? 

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Hi Megan, I ran GenotypeGVCFs from version 4.3.0.0 and that failed to annotate as well. I uploaded a tar.gz file to your ftp with a couple of small gVCFs that you can use for testing. Please let me know if there is anything I can do on my end. Thanks!

    ultima_jg_test.tar.gz

     

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Thank you I will take a look!

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Thanks for sharing your input data. I have discovered that in our pipeline the RAW_GT_COUNT annotation is added in the ReblockGVCFs step. This step is called here in our single sample pipeline. The task is here. So in our usual pipeline the GVCFs that are input to the Joint Genotyping pipeline already have this annotation, which is then propagated through GenotypeGVCFs to CalculateAverageAnnotations. 

    Would it be possible to run ReblockGVCFs before running the Joint Genotyping pipeline?

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Yes, I can run the ReblockGVCFs step on the input gVCFs and then run them through Joint Genotyping. Thanks for looking into it. I will get back to you on how that goes.

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Hi Megan, I ran ReblockGVCF as you suggested and that seems to have fixed the missing annotation error that I was getting at the CalculateAverageCombinedAnnotations step but the step takes a long time to run. I ran it on one of the test files that I shared with you which had ~1.4M records in the VCF and it took ~15hrs. I had kicked off the task on a single sample gVCF and it's been running for > 9 days, and based on the logs it's still on chr1. Is that the expected runtime? Our HaplotypeCaller command to generate the gVCF matches what you have in your Ultima WGS workflow but I noticed you have the ReblockGVCF task in your workflow and was wondering if you also noticed the long runtimes.

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Hmm, that is definitely not expected. When we run on typical gVCFs with ~90M records it takes about 30 minutes. Did you notice any other tasks that seemed particularly slow? What kind of environment are you running this tool in? Is the gVCF local to where you are running ReblockGVCFs? 

    I took a look at the gVCF that you sent previously and ran ReblockGVCFs on it locally and it processed around ~4M records per minute which is around how fast we see this tool run in our production WDL as well, so I don't think there's anything wrong with the input gVCF you have.

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    This is really weird. I am running this step on our on-prem cluster and the files are all local and I haven't noticed any slowness with other jobs. As you can see in the log below, it starts processing at ~173K variants per minute but the rate keeps going down. I will try to do some more debugging. 

    16:09:20.766 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/sw/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    16:09:20.941 INFO  ReblockGVCF - ------------------------------------------------------------
    16:09:20.941 INFO  ReblockGVCF - The Genome Analysis Toolkit (GATK) v4.3.0.0
    16:09:20.941 INFO  ReblockGVCF - For support and documentation go to https://software.broadinstitute.org/gatk/
    16:09:20.941 INFO  ReblockGVCF - Executing as usevani@pi3dc5-015.nygenome.org on Linux v3.10.0-1160.71.1.el7.x86_64 amd64
    16:09:20.942 INFO  ReblockGVCF - Java runtime: Java HotSpot(TM) 64-Bit Server VM v17.0.4+11-LTS-179
    16:09:20.942 INFO  ReblockGVCF - Start Date/Time: February 13, 2023 at 4:09:20 PM EST
    16:09:20.942 INFO  ReblockGVCF - ------------------------------------------------------------
    16:09:20.942 INFO  ReblockGVCF - ------------------------------------------------------------
    16:09:20.942 INFO  ReblockGVCF - HTSJDK Version: 3.0.1
    16:09:20.943 INFO  ReblockGVCF - Picard Version: 2.27.5
    16:09:20.943 INFO  ReblockGVCF - Built for Spark Version: 2.4.5
    16:09:20.943 INFO  ReblockGVCF - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    16:09:20.943 INFO  ReblockGVCF - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    16:09:20.943 INFO  ReblockGVCF - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    16:09:20.943 INFO  ReblockGVCF - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    16:09:20.943 INFO  ReblockGVCF - Deflater: IntelDeflater
    16:09:20.944 INFO  ReblockGVCF - Inflater: IntelInflater
    16:09:20.944 INFO  ReblockGVCF - GCS max retries/reopens: 20
    16:09:20.945 INFO  ReblockGVCF - Requester pays: disabled
    16:09:20.945 INFO  ReblockGVCF - Initializing engine
    16:09:21.242 INFO  FeatureManager - Using codec VCFCodec to read file file:///gpfs/commons/home/usevani/gatk4jg_ug/usevani-test/sample1.chr22.20M.g.vcf.gz
    16:09:21.436 INFO  ReblockGVCF - Done initializing engine
    16:09:21.601 INFO  ProgressMeter - Starting traversal
    16:09:21.601 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    16:09:32.012 INFO  ProgressMeter -       chr22:10689177              0.2                 30000         172910.7
    16:09:42.322 INFO  ProgressMeter -       chr22:10776023              0.3                 45000         130308.9
    16:09:52.846 INFO  ProgressMeter -       chr22:11052506              0.5                 58000         111381.4
    16:10:02.899 INFO  ProgressMeter -       chr22:11229704              0.7                 67000          97341.3
    16:10:14.280 INFO  ProgressMeter -       chr22:11276244              0.9                 75000          85424.7
    16:10:24.517 INFO  ProgressMeter -       chr22:11338500              1.0                 81000          77247.1
    16:10:34.573 INFO  ProgressMeter -       chr22:11431694              1.2                 86000          70713.0
    16:10:44.915 INFO  ProgressMeter -       chr22:11443639              1.4                 91000          65535.2
    16:10:56.504 INFO  ProgressMeter -       chr22:11458651              1.6                 97000          61325.8
    16:11:07.571 INFO  ProgressMeter -       chr22:11470299              1.8                102000          57752.2
    16:11:20.149 INFO  ProgressMeter -       chr22:11482172              2.0                107000          54155.3
    16:11:31.110 INFO  ProgressMeter -       chr22:11491559              2.2                111000          51425.0
    16:11:43.576 INFO  ProgressMeter -       chr22:11563191              2.4                115000          48600.1
    16:11:56.867 INFO  ProgressMeter -       chr22:11613802              2.6                119000          45985.6
    16:12:07.494 INFO  ProgressMeter -       chr22:11685450              2.8                122000          44124.8
    16:12:19.238 INFO  ProgressMeter -       chr22:11698664              3.0                125000          42220.9
    16:12:31.388 INFO  ProgressMeter -       chr22:11708580              3.2                128000          40466.4
    16:12:43.845 INFO  ProgressMeter -       chr22:11775665              3.4                131000          38864.1
    16:12:57.331 INFO  ProgressMeter -       chr22:11787317              3.6                134000          37268.8
    16:13:11.312 INFO  ProgressMeter -       chr22:11794756              3.8                137000          35784.1
    16:13:21.338 INFO  ProgressMeter -       chr22:11800184              4.0                139000          34788.1
    16:13:31.624 INFO  ProgressMeter -       chr22:11806769              4.2                141000          33836.9
    16:13:41.925 INFO  ProgressMeter -       chr22:11812927              4.3                143000          32958.9
    16:13:52.382 INFO  ProgressMeter -       chr22:11817203              4.5                145000          32129.3
    16:14:03.482 INFO  ProgressMeter -       chr22:11821200              4.7                147000          31289.8
    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    I looked into this a bit more and found that when I exclude the "tree_score_cutoff" argument the job completes within a minute on the test sample. Can you confirm if you were running with the "tree_score_cutoff" and "annotations_to_keep_command" arguments?

    I ran this workflow:

    https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/joint_genotyping/reblocking/ReblockGVCF.wdl

    Here is the ReblockGVCF command I am running:

    ${gatk} --java-options "-Xms9000m -Xmx9000m" \
      ReblockGVCF \
      -R ${ref_fasta} \
      -V ${gvcf} \
      -do-qual-approx \
      --floor-blocks -GQB 20 -GQB 30 -GQB 40 \
      --annotations-to-keep ASSEMBLED_HAPS --annotations-to-keep FILTERED_HAPS \
      --tree-score-threshold-to-no-call 0.2 \
      -O ${output_vcf_filename}
    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Ah, so I can confirm that we do run with the tree_score_cutoff argument. But, since you mentioned this, I did notice that you don't have the TREE_SCORE annotation in the gVCFs that you sent earlier. I'm wondering if there's a particularly slow check happening when it tries to access that annotation if it isn't in the input gVCF.  I will make a ticket to look into this particular issue.

    For your use case though, do you intend to filter the VCF with a particular method? In our pipeline we run a random forest model on each single sample which puts a score in the TREE_SCORE annotation in the gVCF. This is generated in the VCF and then copied over into the gVCF. The beginning of where this part of the pipeline happens is here: https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/single_sample/ugwgs/UltimaGenomicsWholeGenomeGermline.wdl#L116

    When we joint call multiple samples, that TREE_SCORE is averaged over all the samples at each site. That AVERAGE_TREE_SCORE is then used in another filtering model. If you're intending filter a joint called VCF, then it's worth having this annotation in your gVCFs.

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Yes, we have the TREE_SCORE annotation in our final VCF. I see there is a MoveAnnotationsToGvcf task in your workflow which you are running before ReblockGVCF. So I will do the same and run the move annotation task then put the output of that through Reblock. 

     

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Perfect, that should fix it. I now see what made ReblockGVCFs so slow, so I'm going to add an error message when the input doesn't contain TREE_SCORE but that tree-score cutoff argument is used. Thanks for bringing this to our attention. Let me know if you run into any more issues!

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Unfortunately, more issues.

    Annotating with TREE_SCORE fixed the runtime issue with ReblockGVCF and the output has the RAW_GT_COUNT but it's missing the TREE_SCORE annotation. The other issue is GenotypeGVCF step does not output the RAW_GT_COUNT even though the input gVCF (from Reblock) contained the annotation. I initially tried running GenotypeGVCF with GenomicsDBImport output and had the same issue but just to confirm I directly ran GenotypeGVCF on the Reblock gVCF as well.

    Here is a warning that I noticed in the log when I ran GenotypeGVCF without the -A RawGtCount argument.

    10:14:18.495 WARN  ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr22:10510356 the annotation RAW_GT_COUNT=[0, 0, 1] was not a numerical value and was ignored

    Even if this process of reannotating the gVCF with TREE_SCORE had worked I am concerned there are going to be variants in the gVCF that will not have this annotation and would be filtered out and I think that might be too aggressive. 

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    When you run ReblockGVCFs you should also include these arguments: "--annotations-to-keep TREE_SCORE --annotations-to-keep ASSEMBLED_HAPS --annotations-to-keep FILTERED_HAPS" in order to keep those annotations in the output GVCF. You'll want the TREE_SCORE annotation for downstream filtering, so I'd recommend having it in the GVCFs before you run the JointGenotyping pipeline.

    You'll want to run GenotypeGVCFs with the arguments "--keep-combined-raw-annotations -A RawGtCount" in order to keep RAW_GT_COUNT in the output. 

    Let me know if you're still having this issue with those arguments. 

    To address your concern about the additional filtering: It is true that low TREE_SCORE variants get filtered out in the ReblockGVCF step. The variants that don't have a TREE_SCORE annotation should be very low quality since they were removed in the GVCF->VCF conversion (with the argument "-stand-call-conf 30"). We have gotten the best results by hard filtering this way since the single sample random forest model performs better without these low quality variants in the data. If you'd instead like to ensure that you keep all TREE_SCORE sites in ReblockGVCFs then you can run ReblockGVCF without the "--tree-score-threshold-to-no-call" argument. You'll still want to keep the annotation with "--annotations-to-keep TREE_SCORE", but that should be independent of actually using a threshold on TREE_SCORE to remove low quality sites.

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    I made the changes you suggested and that fixed the original issue and I was able to make considerable progress in running the Joint Genotyping workflow. Currently, I am blocked because I don't have access to this docker image, gcr.io/terra-project-249020/jukebox_vc:test_jc_optimize_3d7509 used in the UltimaGenomicsGermlineFilteringThreshold.wdl .

    This is unrelated to the original question so let me know if you want me to open another ticket. Thanks for all your help.

    https://github.com/broadinstitute/warp/blob/develop/tasks/broad/UltimaGenomicsGermlineFilteringThreshold.wdl#L146

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Great that you were able to get through some portions of the pipeline! That docker being private is definitely unintentional. I'll make a ticket in the WARP repo and look into that.

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Following up here that the docker is now public here: us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:UG_vc_0.8.1_399932 

    This is a newer version that includes a bug fix and also note that the output of the EvaluateResults task will be a .csv rather than a .txt. The PR with this change is here: https://github.com/broadinstitute/warp/pull/936. It has been merged, but not yet released.

    0
    Comment actions Permalink
  • Avatar
    Uday Evani

    Want to report back that workflow successfully completed on the three samples that I was using to test. Is a reference (HG001, HG002) sample always needed when running this workflow? Just want to know if there is a way to run without including a reference sample or if there are plans eventually to not have that requirement.

    0
    Comment actions Permalink
  • Avatar
    Megan Shand

    Unfortunately a reference sample is required. Ideally we would move away from this requirement, but the work is not prioritized at the moment since the thought was that a reference sample could be included for this tuning in future callsets since the data is available.

    Alternatively, there is an argument to ScoreVariantAnnotations that will attempt to use a cutoff for filtering based on the resource data you provide to that tool: 

    --snp-calibration-sensitivity-threshold

    --indel-calibration-sensitivity-threshold

    You can set those arguments to your desired sensitivity and the tool will choose a threshold to filter on. In practice we found that for Ultima data this didn't work quite as well as using a threshold that maximized F1 based on a reference sample, but it is absolutely an option you can play with. Perhaps there are also other cohort level QC metrics you could also use to try to choose this threshold.

    0
    Comment actions Permalink

Please sign in to leave a comment.

Powered by Zendesk