This is some text inside of a div block.

Meet the Databricks Integration

By Evan Sadler, Senior Machine Learning Engineer at HBOMax

TL;DR Integrating Databricks with Flyte comes with a lot of perks:

  • Flyte provides support for type checking, complex inputs and outputs, a component registry, local execution capabilities and end-to-end testing.
  • Databricks provides an auto-scaling optimized compute engine for Spark.

I am pleased to showcase the capabilities of Flyte's new Databricks plugin, which seamlessly combines the power of Databricks' optimized Spark compute engine with the ease of use of Flyte's workflow orchestration platform.

The Databricks plugin builds on Flyte's  robust integration with Kubernetes Spark to provide users with a seamless and straightforward process for incorporating the Databricks Jobs API. This integration represents a valuable addition to Flyte's suite of tools for streamlined data processing and management. It makes the development process more efficient and less time-consuming so developers can focus on delivering high-quality results.

Let’s get into it

Flyte simplifies the process of using Spark tasks for developers by presenting them in a familiar, vanilla Python format with enforced types. During local development and testing, Flyte employs a local Spark setup that lets developers execute Spark tasks on their local machines, thus facilitating tests and iterations within notebooks. 

When deploying a Flyte workflow or executing it remotely, Flyte takes care of the entire process of packaging and executing code, as well as seamlessly incorporating the results into the workflow. This effortless and integrated approach ensures smooth and efficient data processing, so users don’t have to handle the details of deployment. The result is an optimized workflow with improved data processing capabilities.

Copied to clipboard!
import pandas as pd
from flytekit import task
from flytekitplugins.spark.task import Databricks


@task(
    task_config=Databricks(
        databricks_conf={"run_name": "test databricks", "new_cluster": {...}}
    ),
)
def my_databricks_task(path: str) -> pd.DataFrame:
    # add pyspark code here
    ...

MovieLens example

Let's say we want to evaluate the performance of two recommendation algorithms. We will start by defining a Spark task for preprocessing the data and generating a few key features. For the sake of simplicity, we will forego the validation step in this scenario.

For demonstration purposes, I saved time on setup by employing an existing cluster that already had the necessary Python dependencies installed. In a production environment, you can choose to configure a job cluster using a custom Docker image based on Databricks' base build, ensuring that all necessary dependencies are present. 

Note the intricate outputs of this process, which include a Structured Dataset and a fitted pipeline. Flyte efficiently manages the transfer of these outputs to subsequent tasks that run on various machines in the cloud. This abstraction provides data scientists and ML engineers with a simplified and streamlined experience so they can concentrate on their core tasks instead of worrying about the underlying infrastructure. Flyte is also highly extendable so users can add even more options for customization and optimization.

In the next step, we aim to take the preprocessed data and feed it into two distinct models. To demonstrate the process, I’ve selected two different model types.

I employed a structured dataset to facilitate data transfer between frameworks. Preprocessing output can quickly be loaded into TensorFlow or PyTorch using the Hugging Face datasets plugin, providing a convenient and efficient solution for moving data between platforms.

Defining workflows in Flyte resembles regular Python programming, with dictionaries, loops and subworkflows. Remember that intermediate results are not directly accessible, but represented as promises. A promise is a placeholder for a value that hasn’t been materialized yet. For materialization, you need to send promises to tasks or workflows.

The final step is testing. Testing with Flyte is simple and straightforward: You can easily import Flyte tasks or workflows and run them within your test suite. Writing end-to-end tests is effortless with Flyte because the platform provides special mocks to aid in the process.

Remote execution

To deploy your workflow on the Flyte cluster, utilize the pyflyte CLI that ships with Flytekit. This tool provides a fast registration option when only code changes, making it an efficient and streamlined solution. During the registration process, `pyflyte register` traverses the folder structure and ensures that all related modules are also packaged.

Copied to clipboard!
pyflyte register \
  --image [image name] \
  --version [version name] \
  --destination-dir . \
  flyte_cookiecutter/workflows/databricks.py

You can also schedule workflows using a Launch Plan or kick them off on the Flyte UI.

Dive in

Check out the official instructions to get started with the Databricks integration.

Watch the recording for an in-depth explanation of the plugin.