Calling somatic CNV over multiple intervals and join them subsequently
REQUIRED for all errors and issues:
a) GATK version used: .4.5.0.0 (Docker)
b) Exact command used:
This snipped is from a snakemake I am currently writing:
"gatk --java-options '-Xmx {resources.mem_mb}m' CollectReadCounts "
"-R {input.genome} "
"-L {input.interval_list} "
"--input {input.bam} "
"--read-index {input.bai} "
"--format HDF5 "
"-O {output}
Previously, I Split the Intervals.
My "Problem"/"Question" is now:
Can I just combine the HDF5/TSV files subsequent to the calling without any downstream issue?
I could not find a GATK version to do this.
THank you for your advice
-
Hi Daniel
You can still collect read counts at once but scatter CNV calls to different parts of the original intervals. Final segmentation will collect all scatters and generate a single call for each sample.
For combining read count files per sample if you collect them in tsv format this could be done by scripting your way out however hdf5 files won't be as easy as tsv files.
Regards.
-
Hi Gökalp,
Thank you for your help!
I have two followup question.
When you talk about scattering CNV calls, how would one go about this - outside the Terra/Cromwell universe.- I did not see any thread parameters I could use - so how is it scattering?
- Or is the scattering done via a split interval list and multiple calls - then I do not get how I would do the Final segmentation.
The second question is a follow up on combining/concatinating the tsv files.
- These can be lossless concatinated?
- Do I need to take care of the headers in these files?
-
Or did I miss understood and you reffered to the final segmentation as the
ModelSegments
function, which you then give all the scatters for your sample in a pseudo multi-sample mode?
-
Hi Daniel
Please ignore my previous comment. My thoughts got sidetracked due to Germline CNV workflow scattering.
Unfortunately we do not have any scattering option available for Somatic CNV calling. Somatic CNV calling workflow is not too resource intensive in terms of run duration and memory and cpu requirements like Germline CNV workflow so all should be able to complete in a single run.
As for collecting readcounts in split intervals and combining them, we don't have a ready tool for that and you may need to script your way through it. Header section is necessary for the tsv output so keep in mind that it has to be intact and should contain all the sequence dictionary inside. If you still wish to collect read counts using split intervals make sure that splits do not have any overlaps therefore your read counts don't get confusingly hard to combine.
I hope this helps.
Please sign in to leave a comment.
4 comments