How LatchBio Uses Flyte to Handle Biotech’s Growing Data Demands

The company

A 10-person startup founded in Berkeley, California, LatchBio builds software and data infrastructure for biotech organizations. 

LatchBio supports the cutting edge of biotech, which is progressing at a rate co-founder and CTO Kenny Workman compared to the way Moore’s Law described the advance of computing power: The price of data is dropping and driving new types of research. “The rate at which we can write or synthesize novel pieces of DNA is exceeding that at which we can read it,” Kenny said. “We can now write genetic code that did not have an origin in nature.”

This “tremendous thought shift” has given rise to synthetic biology, metabolic engineering, cell therapy, gene-editing technologies and other groundbreaking advances that involve harnessing engineering principles to create or modify biological products, including food and drugs, Kenny says. 

The challenge

As the sophistication in biotech has increased, so have the demands for computational power to crunch huge amounts of data across systems. 

Working with massive quantities of biological data presents significant challenges. The human genome needs 3.3 gigabytes of storage, so the sheer size of the data is daunting. Another challenge comes from the heterogeneous nature of biological data. Because this type of data can come in several formats from many sources, computational biologists often interpret short fragments of DNA or other molecules. This can take days and hundreds of gigabytes of memory.

LatchBio’s platform — relying on Flyte’s Kubernetes-native workflow execution engine — provides this power at scale.

“We generate no-code interfaces for biologists to use directly, dynamically from a Flytekit wrapper, and by doing this we can expose a myriad of different bioinformatics tools and provide a really rich in-browser suite of visualizations and file manipulation tools,” Workman said. “And then on top of the platform, we're starting to tap into the bioinformatics [needed by] computational biologists and even software engineers at companies, and giving them their own toolkit to both write Flyte workflows and dynamically generate LatchBio interfaces.”

The solution

Flyte orchestration lets LatchBio schedule tasks to maximize computational horsepower.

The LatchBio platform features four main components:

  1. A managed, in-browser object store that allows users to drag and drop objects and see visualizations and metadata. The object store acts like a *nix filesystem — currently based on Amazon S3 — and processes objects with semantic extensions using automated workflow hooks. The process is almost instantaneous, Kenny said. “If, for example, you upload a DNA file that we recognize, the instant it is uploaded to the platform, we run a Flyte workflow that automatically parses, runs quality control and generates visualizations.” Biologists interact with the platform through a console, while computational biologists interact directly with the server through an SDK.  Users can use ‘/’ or ‘latch///’ to customize the workflow parameters.
  2. Its own network path and absolute path system that lets users pass objects or files to workflows directly as semantic paths. The Flytekit wrapper parses the information, making it much easier for users to determine where their data is coming from, where it's going and how they can manipulate it downstream.
  3. A “compiled” type-safe UI. The front-end code for each workflow is generated dynamically from the workflow parameter interface using Flyte’s interface description language (IDL). This is dynamically parsed and used to compile React interfaces from the typing information, Kenny said. The typing information is also used to construct components with in-browser HTML-native validation and rule-based regex validation. These multiple layers enrich the browser-side type checking process—and all of this happens before the workflow even hits the Kubernetes orchestration layer, Kenny said. “We want to make sure that workflows don’t fail before the programming layer gets executed in the container on the cluster.” The style of each parameter component can be modified in code.
  4. Serverless, scalable scheduling that manages file movement and workflow updates.This affords LatchBio a greater level of control over task scheduling to account for the possibility of entirely different needs for upstream or downstream tasks. There’s also the benefit of scalable scheduling, Kenny said: Leveraging spot instances and automatic retries combined with Kubernetes’ scaling capabilities enables greater speed, cost efficiency and scale than bare metal. On top of its core Kubernetes cluster, LatchBio maintains a proxy file system and proxy workflows for customers tied to drop-in Flyte deployments and LatchBio’s own scaffolding, as well as both virtual and managed virtual private clouds (VPCs). That means LatchBio’s proxy file system and proxy workflows don’t ever need to handle customer data or workflows directly. 

Flyte allows language independence, too. This is critical since bioinformatics are often conducted in several languages. (Python is the most common.)

The LatchBio team has made several contributions to the Flyte codebase, which Workman said demonstrates his company’s confidence in Flyte long term as “the de facto workflow orchestration engine.

“I really think Flyte has got the model correct, absolutely correct with respect to architecting and deploying workflows as both code first and leveraging the Kubernetes scheduler to abstract away the scheduling on a per-task granularity.”