Well folks, this new year is off to a heck of a start. I don't know who needs to hear this but it's okay to be struggling. On our end, we're going to do whatever we can to support your work, so don't hesitate to reach out on the forum.
We've already heard from many of you that you'd like to get more insight into what's going on behind the scenes of GATK development. So this year, we're going to try something new on this blog: a monthly roundup of things the team is working on, in addition to the usual (and possibly more frequent)one-off posts about specific tools, features and events. It won't be a complete list, but hopefully we can give you a preview of some interesting things that you can look forward to.
So here's your first monthly recap of what's in the works, starting with a much-requested update on DRAGEN-GATK, then weaving through news about structural variation, long reads, liquid biopsy, CNV calling in Plasmodium, and identifying viral insertion sites with Pathseq.
The DRAGEN in the room
As mentioned previously, we've been working with the DRAGEN team at Illumina on a joint DRAGEN-GATK Best Practices pipeline for calling short germline variants that will replace the current Best Practices for that use case. In our most recent release estimate, we predicted that the new pipeline would be ready in November 2020. Obviously that hasn't happened, so here's what's going on.
The new pipeline will use DRAGMAP instead of BWA for genome mapping and alignment, which was originally implemented on the proprietary DRAGEN hardware. Since a key goal of the DRAGEN-GATK collaboration is to provide a fully open-source version of the joint pipeline, the DRAGEN team has been working on reimplementing DRAGMAP as an open-source software project. It's taking a bit longer than we expected, which is not unusual for the very first implementation of a rather sophisticated tool, so the release has been delayed as a result.
At this time it's difficult to predict exactly how much longer it will be, but I'm confident the release will happen sometime in the first half of 2021. In the meantime, we're looking into whether it would make sense for folks to already proceed with evaluations of the downstream changes to the pipeline (mainly HaplotypeCaller's new DRAGEN mode) using data aligned with BWA. Stay tuned for more on that.
Structural variation galore
Where do I even start? We have a ton of SV-related work in flight. The project closest to fruition is a Terra workspace that we're building to showcase our single-sample SV calling pipeline, which the Broad's internal clinical analysis team has already used successfully on 117 clinical samples. We recently generated public resource data for this pipeline based on the new(ish) PCR-free genomes from 1000 Genomes Project, and one of our collaborators from the Centers for Mendelian Genetics (CMG) has been “kicking the tires” using a small set of samples. You can already find a beta version of the pipeline on Dockstore; the goal of the Terra workspace is to consolidate resources, sample data and the code itself of course into a fully reproducible example that runs out of the box (read more about that here).
We also have a multi-sample or "cohort" version of the SV pipeline that is performing well, though it's fairly complex and we have work to do to make it readily usable by external teams. Although on that front, the good news is that we got some very positive feedback from collaborators at NIH who ran ~10,000 samples through the SV cohort pipeline on their own Cromwell server with minimal assistance. We take that as a sign that the pipeline may soon be ready to roll out to a wider audience, although that would be in a beta form since we are still actively working on improving its performance and composition.
Speaking of which, we've completed prototyping for two of our long running projects. One is a new SV filtering model that uses machine learning to replace some of the heuristics we previously used; we're satisfied with the prototype results and are now working to implement it as a proper GATK tool. The other is an updated approach to breakpoint assembly that no longer uses Spark and is more compatible with WDL pipelines and Cromwell. We're now looking at how best to use this in the SV pipeline, be it for all events, events lacking some types of evidence, or at the end of the pipeline as a final confirmation step.
Long reads, at long last
So apparently there are these other sequencing technologies that produce longer reads, which are really useful for things like de novo assembly and SV calling? Have you heard about those? Okay yes I'm joking, long reads are admittedly not new. I bring this up because for about a year now we've had a team working on projects involving long read data (from both PacBio and Nanopore) and some of them are starting to come to fruition. There's enough material there for an entire standalone blog post though, so consider this a teaser pending that.
CLIA-approved liquid biopsy pipeline
Switching to somatic news, this is exciting: our pipeline for liquid biopsy — a diagnostic test for cancer based on a blood sample — has been approved for clinical use and applied successfully to a pilot cohort of clinical patient samples. We plan to make it available free and open-source with the rest of our pipelines.
CNV genotyping in Plasmodium, the malaria parasite
Image credit: Sanger Institute blog
In a very different kind of somatic analysis, we've been collaborating with a team at the Sanger Institute on methods for calling copy number variation in Plasmodium genes pfMDR1 and pfPM2/3, which is difficult to do using the depth signal alone. Together, we've developed a pipeline prototype that uses split read and paired end information in addition to the depth signal, and are now working on evaluating the performance of this new method. This project is part of the MalariaGEN initiative.
The Pathseq less traveled: identifying viral insertion sites
Moving further into the microbial world… If you're not familiar with Pathseq, it's a metagenomics-based tool/pipeline that can identify DNA contamination from other organisms in human genome data. Have you ever wondered what happens if you eat a nice juicy hamburger before giving a spit sample for DNA sequencing? Yep, cow DNA all up in your genome data. Pathseq can help clean that up for you. And, as we recently learned from a neat internal collaboration between developers and support team, it turns out you can also use it to identify viral insertion sites, like for example HPV in a human genome. You can read the full story on the Terra blog; not only is it a cool use of the Pathseq tool, it's also an excellent example of how you might tackle a computational problem from start to finish, leveraging synthetic data and a Terra workspace for provenance and shareability.
That's all for this update!
Let us know if this is helpful and if there's anything in particular you want to hear more about.
Please stay safe, wear a mask, be kind to one another.