Missing RawGtCount annoation from GenotypeGVCF output
The GenotypeGVCF task in the UltimaGenomicsJointGenotyping.wdl completes successfully but does not add the RawGtCount annotation to the VCF because of which the next step, CalculateAverageAnnotations, is failing. Please find below the version, command, and truncated log. I also tried running with 4.3.0.0 and had the same issue.
REQUIRED for all errors and issues:
a) GATK version used:
4.2.6.1-35-gcaec3a9-SNAPSHOT
b) Exact command used:
/sw/gatk/gatk-4.2.6.1/gatk GenotypeGVCFs \
-R GRCh38_full_analysis_set_plus_decoy_hla.fa \
-O test_10_samples.0.vcf.gz \
-D All_20180418.vcf.gz \
-G StandardAnnotation -G AS_StandardAnnotation \
--only-output-calls-starting-in-intervals \
-V gendb://genomicsdb \
-L 0000-scattered.interval_list \
-A RawGtCount \
--allow-old-rms-mapping-quality-annotation-data \
--keep-combined-raw \
-annotations \
--merge-input-intervals
c) Entire program log:
23:02:12.534 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
23:02:12.729 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.2.6.1-35-gcaec3a9-SNAPSHOT-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:02:12.814 INFO GenotypeGVCFs - ------------------------------------------------------------
23:02:12.814 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.6.1-35-gcaec3a9-SNAPSHOT
23:02:12.815 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
23:02:12.815 INFO GenotypeGVCFs - Executing as root@2dda47b37bd9 on Linux v5.15.65+ amd64
23:02:12.815 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
23:02:12.815 INFO GenotypeGVCFs - Start Date/Time: January 13, 2023 11:02:12 PM GMT
23:02:12.815 INFO GenotypeGVCFs - ------------------------------------------------------------
23:02:12.815 INFO GenotypeGVCFs - ------------------------------------------------------------
23:02:12.816 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
23:02:12.816 INFO GenotypeGVCFs - Picard Version: 2.27.1
23:02:12.816 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
23:02:12.816 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:02:12.817 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:02:12.817 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:02:12.817 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:02:12.817 INFO GenotypeGVCFs - Deflater: IntelDeflater
23:02:12.817 INFO GenotypeGVCFs - Inflater: IntelInflater
23:02:12.817 INFO GenotypeGVCFs - GCS max retries/reopens: 20
23:02:12.817 INFO GenotypeGVCFs - Requester pays: disabled
23:02:12.818 INFO GenotypeGVCFs - Initializing engine
23:02:14.697 INFO FeatureManager - Using codec VCFCodec to read file gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
23:02:16.993 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
23:02:18.330 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field AS_HmerLength - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.330 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field AS_InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.330 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field AS_QD - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.331 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field HEC - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.331 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.331 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.331 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.331 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field SUSP_NOISY_ADJACENT_TP_VARIANT - the field will NOT be part of INFO fields in the generated VCF records
23:02:18.331 info NativeGenomicsDB - pid=19 tid=20 No valid combination operation found for INFO field XC - the field will NOT be part of INFO fields in the generated VCF records
23:02:20.519 INFO FeatureManager - Using codec IntervalListCodec to read file gs://nygc-comp-s-fd4e-workspace/UltimaGenomicsJointGenotyping/f1487eab-1cfe-488b-9664-199f04b80490/call-SplitIntervalList/glob-d928cd0f5fb17b6bd5e635f48c18ccfb/0000-scattered.interval_list
23:02:20.829 INFO IntervalArgumentCollection - Processing 385617781 bp from intervals
23:02:20.929 INFO GenotypeGVCFs - Done initializing engine
23:02:21.069 INFO ProgressMeter - Starting traversal
23:02:21.069 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
23:02:29.968 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr1:10439 the annotation HAPCOMP=[1, 0] was not a numerical value and was ignored
23:02:30.046 WARN InbreedingCoeff - InbreedingCoeff will not be calculated at position chr1:10439 and possibly subsequent; at least 10 samples must have called genotypes
23:02:31.141 INFO ProgressMeter - chr1:14234 0.2 1000 5957.7
23:02:41.229 INFO ProgressMeter - chr1:894811 0.3 187000 556547.6
23:02:51.230 INFO ProgressMeter - chr1:1286408 0.5 455000 905142.4
23:03:01.240 INFO ProgressMeter - chr1:1666733 0.7 707000 1055985.7
23:03:11.261 INFO ProgressMeter - chr1:2133606 0.8 996000 1190628.0
23:03:21.669 INFO ProgressMeter - chr1:2455682 1.0 1212000 1200019.8
23:03:31.790 INFO ProgressMeter - chr1:2967650 1.2 1512000 1282787.3
23:03:43.568 INFO ProgressMeter - chr1:3395489 1.4 1808000 1314925.0
...
...
00:58:56.284 INFO ProgressMeter - chr2:135074029 116.6 206962000 1775173.5
00:59:06.749 INFO ProgressMeter - chr2:135606666 116.8 207255000 1775031.1
00:59:16.764 INFO ProgressMeter - chr2:136206838 116.9 207587000 1775336.6
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),1403.031613934034,Cpu time(s),1391.8374969511833
00:59:24.296 INFO ProgressMeter - chr2:136660496 117.1 207834445 1775546.6
00:59:24.296 INFO ProgressMeter - Traversal complete. Processed 207834445 total variants in 117.1 minutes.
00:59:24.354 INFO GenotypeGVCFs - Shutting down engine
[January 14, 2023 12:59:24 AM GMT] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 117.19 minutes.
Runtime.totalMemory()=8477212672
Using GATK jar /gatk/gatk-package-4.2.6.1-35-gcaec3a9-SNAPSHOT-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms8000m -Xmx25000m -jar /gatk/gatk-package-4.2.6.1-35-gcaec3a9-SNAPSHOT-local.jar GenotypeGVCFs -R /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta -O test_10_samples.0.vcf.gz -D gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf -G StandardAnnotation -G AS_StandardAnnotation --only-output-calls-starting-in-intervals -V gendb://genomicsdb -L gs://nygc-comp-s-fd4e-workspace/UltimaGenomicsJointGenotyping/f1487eab-1cfe-488b-9664-199f04b80490/call-SplitIntervalList/glob-d928cd0f5fb17b6bd5e635f48c18ccfb/0000-scattered.interval_list -A RawGtCount --allow-old-rms-mapping-quality-annotation-data --keep-combined-raw-annotations --merge-input-intervals
-
Hello,
I'd highly recommend using version 4.3.0.0 for this since that was the release that includes Ultima Genomics support. That said, I'm not sure why this annotation wouldn't be getting added when you use version 4.3.0.0. Did you try using that version for all of the steps including GenomicsDBImport? Would it be possible for you to share a very small snippet of the input GVCFs you have so I can try to reproduce the issue to debug?
-
Hi Megan, I ran GenotypeGVCFs from version 4.3.0.0 and that failed to annotate as well. I uploaded a tar.gz file to your ftp with a couple of small gVCFs that you can use for testing. Please let me know if there is anything I can do on my end. Thanks!
ultima_jg_test.tar.gz
-
Thank you I will take a look!
-
Thanks for sharing your input data. I have discovered that in our pipeline the RAW_GT_COUNT annotation is added in the ReblockGVCFs step. This step is called here in our single sample pipeline. The task is here. So in our usual pipeline the GVCFs that are input to the Joint Genotyping pipeline already have this annotation, which is then propagated through GenotypeGVCFs to CalculateAverageAnnotations.
Would it be possible to run ReblockGVCFs before running the Joint Genotyping pipeline?
-
Yes, I can run the ReblockGVCFs step on the input gVCFs and then run them through Joint Genotyping. Thanks for looking into it. I will get back to you on how that goes.
-
Hi Megan, I ran ReblockGVCF as you suggested and that seems to have fixed the missing annotation error that I was getting at the CalculateAverageCombinedAnnotations step but the step takes a long time to run. I ran it on one of the test files that I shared with you which had ~1.4M records in the VCF and it took ~15hrs. I had kicked off the task on a single sample gVCF and it's been running for > 9 days, and based on the logs it's still on chr1. Is that the expected runtime? Our HaplotypeCaller command to generate the gVCF matches what you have in your Ultima WGS workflow but I noticed you have the ReblockGVCF task in your workflow and was wondering if you also noticed the long runtimes.
-
Hmm, that is definitely not expected. When we run on typical gVCFs with ~90M records it takes about 30 minutes. Did you notice any other tasks that seemed particularly slow? What kind of environment are you running this tool in? Is the gVCF local to where you are running ReblockGVCFs?
I took a look at the gVCF that you sent previously and ran ReblockGVCFs on it locally and it processed around ~4M records per minute which is around how fast we see this tool run in our production WDL as well, so I don't think there's anything wrong with the input gVCF you have.
-
This is really weird. I am running this step on our on-prem cluster and the files are all local and I haven't noticed any slowness with other jobs. As you can see in the log below, it starts processing at ~173K variants per minute but the rate keeps going down. I will try to do some more debugging.
16:09:20.766 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/sw/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:09:20.941 INFO ReblockGVCF - ------------------------------------------------------------
16:09:20.941 INFO ReblockGVCF - The Genome Analysis Toolkit (GATK) v4.3.0.0
16:09:20.941 INFO ReblockGVCF - For support and documentation go to https://software.broadinstitute.org/gatk/
16:09:20.941 INFO ReblockGVCF - Executing as usevani@pi3dc5-015.nygenome.org on Linux v3.10.0-1160.71.1.el7.x86_64 amd64
16:09:20.942 INFO ReblockGVCF - Java runtime: Java HotSpot(TM) 64-Bit Server VM v17.0.4+11-LTS-179
16:09:20.942 INFO ReblockGVCF - Start Date/Time: February 13, 2023 at 4:09:20 PM EST
16:09:20.942 INFO ReblockGVCF - ------------------------------------------------------------
16:09:20.942 INFO ReblockGVCF - ------------------------------------------------------------
16:09:20.942 INFO ReblockGVCF - HTSJDK Version: 3.0.1
16:09:20.943 INFO ReblockGVCF - Picard Version: 2.27.5
16:09:20.943 INFO ReblockGVCF - Built for Spark Version: 2.4.5
16:09:20.943 INFO ReblockGVCF - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:09:20.943 INFO ReblockGVCF - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:09:20.943 INFO ReblockGVCF - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:09:20.943 INFO ReblockGVCF - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:09:20.943 INFO ReblockGVCF - Deflater: IntelDeflater
16:09:20.944 INFO ReblockGVCF - Inflater: IntelInflater
16:09:20.944 INFO ReblockGVCF - GCS max retries/reopens: 20
16:09:20.945 INFO ReblockGVCF - Requester pays: disabled
16:09:20.945 INFO ReblockGVCF - Initializing engine
16:09:21.242 INFO FeatureManager - Using codec VCFCodec to read file file:///gpfs/commons/home/usevani/gatk4jg_ug/usevani-test/sample1.chr22.20M.g.vcf.gz
16:09:21.436 INFO ReblockGVCF - Done initializing engine
16:09:21.601 INFO ProgressMeter - Starting traversal
16:09:21.601 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
16:09:32.012 INFO ProgressMeter - chr22:10689177 0.2 30000 172910.7
16:09:42.322 INFO ProgressMeter - chr22:10776023 0.3 45000 130308.9
16:09:52.846 INFO ProgressMeter - chr22:11052506 0.5 58000 111381.4
16:10:02.899 INFO ProgressMeter - chr22:11229704 0.7 67000 97341.3
16:10:14.280 INFO ProgressMeter - chr22:11276244 0.9 75000 85424.7
16:10:24.517 INFO ProgressMeter - chr22:11338500 1.0 81000 77247.1
16:10:34.573 INFO ProgressMeter - chr22:11431694 1.2 86000 70713.0
16:10:44.915 INFO ProgressMeter - chr22:11443639 1.4 91000 65535.2
16:10:56.504 INFO ProgressMeter - chr22:11458651 1.6 97000 61325.8
16:11:07.571 INFO ProgressMeter - chr22:11470299 1.8 102000 57752.2
16:11:20.149 INFO ProgressMeter - chr22:11482172 2.0 107000 54155.3
16:11:31.110 INFO ProgressMeter - chr22:11491559 2.2 111000 51425.0
16:11:43.576 INFO ProgressMeter - chr22:11563191 2.4 115000 48600.1
16:11:56.867 INFO ProgressMeter - chr22:11613802 2.6 119000 45985.6
16:12:07.494 INFO ProgressMeter - chr22:11685450 2.8 122000 44124.8
16:12:19.238 INFO ProgressMeter - chr22:11698664 3.0 125000 42220.9
16:12:31.388 INFO ProgressMeter - chr22:11708580 3.2 128000 40466.4
16:12:43.845 INFO ProgressMeter - chr22:11775665 3.4 131000 38864.1
16:12:57.331 INFO ProgressMeter - chr22:11787317 3.6 134000 37268.8
16:13:11.312 INFO ProgressMeter - chr22:11794756 3.8 137000 35784.1
16:13:21.338 INFO ProgressMeter - chr22:11800184 4.0 139000 34788.1
16:13:31.624 INFO ProgressMeter - chr22:11806769 4.2 141000 33836.9
16:13:41.925 INFO ProgressMeter - chr22:11812927 4.3 143000 32958.9
16:13:52.382 INFO ProgressMeter - chr22:11817203 4.5 145000 32129.3
16:14:03.482 INFO ProgressMeter - chr22:11821200 4.7 147000 31289.8 -
I looked into this a bit more and found that when I exclude the "tree_score_cutoff" argument the job completes within a minute on the test sample. Can you confirm if you were running with the "tree_score_cutoff" and "annotations_to_keep_command" arguments?
I ran this workflow:
Here is the ReblockGVCF command I am running:
${gatk} --java-options "-Xms9000m -Xmx9000m" \
ReblockGVCF \
-R ${ref_fasta} \
-V ${gvcf} \
-do-qual-approx \
--floor-blocks -GQB 20 -GQB 30 -GQB 40 \
--annotations-to-keep ASSEMBLED_HAPS --annotations-to-keep FILTERED_HAPS \
--tree-score-threshold-to-no-call 0.2 \
-O ${output_vcf_filename} -
Ah, so I can confirm that we do run with the tree_score_cutoff argument. But, since you mentioned this, I did notice that you don't have the TREE_SCORE annotation in the gVCFs that you sent earlier. I'm wondering if there's a particularly slow check happening when it tries to access that annotation if it isn't in the input gVCF. I will make a ticket to look into this particular issue.
For your use case though, do you intend to filter the VCF with a particular method? In our pipeline we run a random forest model on each single sample which puts a score in the TREE_SCORE annotation in the gVCF. This is generated in the VCF and then copied over into the gVCF. The beginning of where this part of the pipeline happens is here: https://github.com/broadinstitute/warp/blob/develop/pipelines/broad/dna_seq/germline/single_sample/ugwgs/UltimaGenomicsWholeGenomeGermline.wdl#L116
When we joint call multiple samples, that TREE_SCORE is averaged over all the samples at each site. That AVERAGE_TREE_SCORE is then used in another filtering model. If you're intending filter a joint called VCF, then it's worth having this annotation in your gVCFs.
-
Yes, we have the TREE_SCORE annotation in our final VCF. I see there is a MoveAnnotationsToGvcf task in your workflow which you are running before ReblockGVCF. So I will do the same and run the move annotation task then put the output of that through Reblock.
-
Perfect, that should fix it. I now see what made ReblockGVCFs so slow, so I'm going to add an error message when the input doesn't contain TREE_SCORE but that tree-score cutoff argument is used. Thanks for bringing this to our attention. Let me know if you run into any more issues!
-
Unfortunately, more issues.
Annotating with TREE_SCORE fixed the runtime issue with ReblockGVCF and the output has the RAW_GT_COUNT but it's missing the TREE_SCORE annotation. The other issue is GenotypeGVCF step does not output the RAW_GT_COUNT even though the input gVCF (from Reblock) contained the annotation. I initially tried running GenotypeGVCF with GenomicsDBImport output and had the same issue but just to confirm I directly ran GenotypeGVCF on the Reblock gVCF as well.
Here is a warning that I noticed in the log when I ran GenotypeGVCF without the -A RawGtCount argument.
10:14:18.495 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr22:10510356 the annotation RAW_GT_COUNT=[0, 0, 1] was not a numerical value and was ignored
Even if this process of reannotating the gVCF with TREE_SCORE had worked I am concerned there are going to be variants in the gVCF that will not have this annotation and would be filtered out and I think that might be too aggressive.
-
When you run ReblockGVCFs you should also include these arguments: "--annotations-to-keep TREE_SCORE --annotations-to-keep ASSEMBLED_HAPS --annotations-to-keep FILTERED_HAPS" in order to keep those annotations in the output GVCF. You'll want the TREE_SCORE annotation for downstream filtering, so I'd recommend having it in the GVCFs before you run the JointGenotyping pipeline.
You'll want to run GenotypeGVCFs with the arguments "--keep-combined-raw-annotations -A RawGtCount" in order to keep RAW_GT_COUNT in the output.
Let me know if you're still having this issue with those arguments.
To address your concern about the additional filtering: It is true that low TREE_SCORE variants get filtered out in the ReblockGVCF step. The variants that don't have a TREE_SCORE annotation should be very low quality since they were removed in the GVCF->VCF conversion (with the argument "-stand-call-conf 30"). We have gotten the best results by hard filtering this way since the single sample random forest model performs better without these low quality variants in the data. If you'd instead like to ensure that you keep all TREE_SCORE sites in ReblockGVCFs then you can run ReblockGVCF without the "--tree-score-threshold-to-no-call" argument. You'll still want to keep the annotation with "--annotations-to-keep TREE_SCORE", but that should be independent of actually using a threshold on TREE_SCORE to remove low quality sites.
-
I made the changes you suggested and that fixed the original issue and I was able to make considerable progress in running the Joint Genotyping workflow. Currently, I am blocked because I don't have access to this docker image, gcr.io/terra-project-249020/jukebox_vc:test_jc_optimize_3d7509 used in the UltimaGenomicsGermlineFilteringThreshold.wdl .
This is unrelated to the original question so let me know if you want me to open another ticket. Thanks for all your help.
-
Great that you were able to get through some portions of the pipeline! That docker being private is definitely unintentional. I'll make a ticket in the WARP repo and look into that.
-
Following up here that the docker is now public here: us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:UG_vc_0.8.1_399932
This is a newer version that includes a bug fix and also note that the output of the EvaluateResults task will be a .csv rather than a .txt. The PR with this change is here: https://github.com/broadinstitute/warp/pull/936. It has been merged, but not yet released.
-
Want to report back that workflow successfully completed on the three samples that I was using to test. Is a reference (HG001, HG002) sample always needed when running this workflow? Just want to know if there is a way to run without including a reference sample or if there are plans eventually to not have that requirement.
-
Unfortunately a reference sample is required. Ideally we would move away from this requirement, but the work is not prioritized at the moment since the thought was that a reference sample could be included for this tuning in future callsets since the data is available.
Alternatively, there is an argument to ScoreVariantAnnotations that will attempt to use a cutoff for filtering based on the resource data you provide to that tool:
--snp-calibration-sensitivity-threshold
--indel-calibration-sensitivity-thresholdYou can set those arguments to your desired sensitivity and the tool will choose a threshold to filter on. In practice we found that for Ultima data this didn't work quite as well as using a threshold that maximized F1 based on a reference sample, but it is absolutely an option you can play with. Perhaps there are also other cohort level QC metrics you could also use to try to choose this threshold.
Please sign in to leave a comment.
19 comments