In general you should use Terra, which has all the major GATK workflows preloaded, is more scalable and makes it easier to share any work you do with external collaborators, since the portal is publicly accessible and you can grant anyone access to workspaces securely and conveniently.
However, there are a couple of few Broad-internal resources that you can use if FireCloud is not yet a suitable option for you.
- Dotkits for running GATK CNV and ACNV
- GATK CNV Toolchain in Firehose
1. Dotkits for running GATK CNV and ACNV
The following dotkits should load all the necessary dependencies:
use .hdfview-2.9 use Java-1.8 use .r-3.1.3-gatk-only
If these don't work, move to a VM where the dotkits are not broken. If that still doesn't work, go to FireCloud.
2. GATK CNV Toolchain in Firehose
We make this available as a courtesy, but we will not be able to provide support for any Firehose-specific aspects. Note that Firehose has been phased out since 2018, and you will have needed to move your work to Terra by then. Rest assured we will provide support for the migration (phase-out calendar TBD).
We have put the GATK4 Somatic CNV Toolchain into Firehose. Please copy the below workflows from
Frequently asked questions:
Who do I contact with an issue?
First, make sure that your question is not here or in another forum post. If it is a Firehose issue or you are not sure, email
firstname.lastname@example.org. If you are sure that it is an issue with GATK CNV, ACNV, or GetBayesianHetPulldown, post to the forum.
What is GATK CNV vs. ACNV and which are run in the workflows above?
- GATK CNV estimates total copy ratio and performs segmentation and (basic) event calling. This tool works very similarly to ReCapSeg (for now).
- GATK ACNV creates credible intervals for copy ratio and minor allelic fraction (MAF). Under the hood, this tool is very different from Allelic CapSeg, but it can produce a file that can be ingested by ABSOLUTE (i.e. file is in same format produced by Allelic CapSeg)
- Both GATK CNV and ACNV are in the workflows above.
Are the results (e.g. sensitivity and precision) better than ReCapSeg in the GATK CNV toolchain?
If you talk about running without the allelic integration, then the results are equivalent. If you want more details, ask in the forum or invite us to talk to you -- we have a presentation or two about this topic.
Do I run these workflows on Pair Sets or Individual Sets?
What entity types do the tasks run on?
Samples and Pairs. I realize that the above question says to run the workflow on Individual Sets. This is to work around a Firehose issue.
What are the caveats around WGS?
- The total copy number tasks (similar to ReCapSeg) take about a tenth of the time as ReCapSeg, assuming good NFS performance. This is a good thing.
- The allelic tasks (GetBayesianPulldown and Allelic CNV) take a very long time to run. Over a day of runtime is not uncommon. In the next version of the GATK4 CNV Toolchain, we will have addressed this issue, but due to dispatch limitations, Firehose may not be able to fully capitalize on these improvements.
- The runtimes in general are very very sensitive to the filesystem performance.
- The results still have the same oversegmentation issues that you will see in ReCapSeg. There is a GC correction tool, but this has not been integrated into the Firehose workflow.
- There is a bug in a third-party library that limits the size of a PoN. This is unlikely to be an issue for capture, but can become a problem for WGS. Some additional details:
Due to a shortcoming in the Java interface of the underlying storage space (HDF5), a single matrix cannot exceed a number of bytes equal to MAX_INT, which is 2147483647
Therefore, S x T x 8 < 2147483647, where S is the number of samples and T is the number of targets.
In a simpler form: S x T < 268435455
This is usually only an issue for WGS PoNs. In this case, we recommend setting larger binsize (e.g. 5000) to avoid a very large number of targets (T).
What is the future of ReCapSeg?
We are phasing out ReCapSeg, for many reasons, everywhere -- not just Firehose. If you would like more details, post to the forum and we'll respond.
What is the future of Allelic CapSeg?
We have never supported (and never will support) Allelic CapSeg and cannot answer that question. We have some results comparing Allelic CapSeg and GATK ACNV. We can show you if you are interested (internal to Broad only).
Why are there fewer plots than in ReCapSeg?
We did not include plots that we did not believe were being used. If you would like to include additional plots, please post to the forum.
How is the GATK 4 CNV toolchain workflow better than the ReCapSeg workflow?
- Faster. On exome, ReCapSeg takes ~105 minutes per case sample. GATK CNV takes < 30 minutes. Both time estimates assume good performance of NFS filesystem.
- The workflows above include allelic integration results, from the tool GATK ACNV. These results are analogous to what Allelic CapSeg produces.
- The workflow above produces results compatible with ABSOLUTE and TITAN. I.e. the results can be used as input to ABSOLUTE or TITAN.
- All future improvements and bugfixes are going into GATK, not ReCapSeg. And many improvements are coming....
- The workflows produce germline heterzygous SNP call files.
- The ReCapSeg WGS workflow no longer works.
Are there new PoNs for these workflows?
Yes, but the PoN locations are already populated, if you run the workflows properly. You should not need to do any set up yourself.
Is the correct PoN automatically selected for ICE vs. Agilent samples?
Yes, if you run the workflow as provided.
Is there a PoN creation workflow in Firehose?
No. Never going to happen. Don't ask. See the forum for instructions to create PoNs.
Can I run ABSOLUTE from the output of GATK ACNV?
Yes. The annotations are
gatk4cnv_acnv_acs_seg_file_capture (capture) and
Can I run TITAN from the output of GATK ACNV?
Yes, though there has been little testing done on this. The annotations are
Do the workflows above include Oncotator gene lists?
These workflows include Picard Target Mapper. Isn't that going to cause me to have to rerun all of my jobs (e.g. MuTect)?
The workflows above will rerun Picard Target Mapper, but only new annotations are added. All previous output annotations of Picard Target Mapper should be populated with the same values. This will look as if it outdated mutation calling (MuTect) and other tasks, but the rerunning will be job-avoided.
Can I do the tumor-only GATK ACNV workflow?
For exome that is working well, but is not available in Firehose. If you would like to see evaluation data for tumor-only on exome, we can show you (internal to Broad only).
What are all of the annotations produced?
Where applicable, each of the list below also has a
- gatk4cnv_seg_file_capture -- seg file of GATK CNV. This file is analogous to the ReCapSeg seg file.
- gatk4cnv_tn_file_capture -- tangent normalized (denoised) target copy ratio estimates of GATK CNV. This file is analogous to the ReCapSeg tn file.
- gatk4cnv_pre_tn_file_capture -- coverage profile (i.e. target copy ratio estimates without denoising) of GATK CNV. This file is analogous to the ReCapSeg tn file.
- gatk4cnv_betahats_capture -- Tangent normalization coefficients used in the projection. This is in the weeds.
- gatk4cnv_called_seg_file_capture -- output called seg file of GATK CNV. This file is analogous to the ReCapSeg called seg file.
gatk4cnv_oncotated_called_seg_file_capture -- gene list file generated from the GATK CNV segments
gatk4cnv_dqc_capture (coming later) -- measure of noise reduction in the tangent normalization process. Lower is better.
gatk4cnv_preqc_capture (coming later) -- measure of noise before tangent normalization
- gatk4cnv_postqc_capture (coming later) -- measure of noise after tangent normalization
- gatk4cnv_num_seg_capture (coming later) -- number of segments in the GATK CNV output
- gatk4cnv_case_het_file_capture -- het pulldown file for the tumor sample in the pair.
- gatk4cnv_control_het_file_capture -- het pulldown file for the normal sample in the pair.
gatk4cnv_acnv_seg_file_capture -- ACNV seg file with confidence intervals for copy ratio and minor allelic fraction.
gatk4cnv_acnv_acs_seg_file_capture -- ACNV seg file in a format that looks as if it was produced by AllelicCapSeg. Any segments called as "balanced" will be pegged to a MAF of 0.5. This file is ready for ingestion by ABSOLUTE.
gatk4cnv_acnv_cnv_seg_file_capture -- ACNV seg file in a format that looks as if it was produced by GATK CNV
- gatk4cnv_acnv_titan_het_file_capture -- het file in a format that can be ingested by TITAN
- gatk4cnv_acnv_titan_cr_file_capture -- target copy ratio estimates file in a format that can be ingested by TITAN
- gatk4cnv_acnv_cnloh_balanced_file_capture -- ACNV seg file with calls for whether a segment is balanced or CNLoH (or neither).
Do the workflows also run on the normals?
GATK CNV, yes.
GATK ACNV, no.
There is a het pulldown generated for the normal, as a side effect, when doing the het pulldown for the tumor.
What about array data?
The GATK4 CNV tools do not run on array data. Sequencing data only.
Do we still need separate PoNs if we want to run on X and Y?
Can I run both the ReCapSeg workflow and the GATK CNV toolchain workflow?
Yes. All results are written to separate annotations.
Are the new workflows part of my PrAn?
No, not yet. You will need to copy (and run) these manually from
Algorithm_Commons before you begin analysis. As a reminder, copy into your analysis workspace.
Does GATK CNV require matched (tumor-normal) samples?
Does GATK ACNV require matched (tumor-normal) samples?
In Firehose, yes. Out of Firehose, no.
How do I modify the ABSOLUTE tasks in FH to accept the new GATK ACNV annotations?
There are two changes you need to make to the ABSOLUTE_v1.5_WES configuration to make it accept the new outputs.
- replace alleliccapseg_tsv with gatk4cnv_acnv_acs_seg_file_capture in the inputs
- replace alleliccapseg_skew with 0.9883274, and change the annotation type to "Literal" instead of "Simple Expression"