How to Serve ML Models with Banana
Orchestrate your ML training pipeline and serve your model at scale in 10 minutes or less.
Before you train an ML model requires you to procure, prepare and analyze the data you’ll use to build and train it. When there’s additional data, you may need to repeat the steps multiple times — a process known as “retraining.” To ensure reproducibility, it may be necessary to version a retrained model. Additionally, in order to keep the execution cost-effective, it may be necessary to cache the outputs of the model. To complete the loop, the model must be deployed so that end users can generate predictions.
I set out on a journey to build an application that would orchestrate a fine-tuned pipeline and enable on-demand, scalable model serving. My approach:
- Build a fine-tuning pipeline leveraging 🤗 Hugging Face transformers.
- Allocate additional memory for resource-intensive operations and a GPU to expedite model training.
- Cache the dataset to reduce the execution time.
- Upon completion of training, push the trained model to 🤗 Hub.
- Enable the user to approve or reject the model deployment.
- If approved, retrieve the model from the Hub and deploy it.
- Invoke the endpoint to generate predictions.
In addition, I wanted to ensure that the end-to-end pipeline maintained data lineage and versioning.
Note: Do you want to jump directly to the code? It’s on GitHub.
⛓️ The fine-tuning pipeline
- Downloading the dataset
- Tokenizing the dataset
- Splitting the dataset into training and evaluation subsets
- Training an ML model using the subsets.
Let’s use Flytekit – Flyte’s Python SDK — to build the pipeline.
📥 Downloading the dataset
Download the `yelp_review_full` dataset and store it in a directory. Use FlyteDirectory, a custom Flyte type that facilitates smooth communication between Flyte tasks, to enable automatic uploading and downloading of directories. Configure the resources to guarantee that the task has adequate resources to acquire the dataset. Enable faster execution by caching the task output.
🧹 Tokenizing the data
Load a predefined `bert-base-uncased` tokenizer, apply it to the Yelp dataset, and store the resulting output in a directory. In addition, cache the tokenized data and allocate the necessary resources for the task. This will improve the task's execution speed and efficiency.
🖖 Splitting the dataset
Divide the dataset into two separate subsets for training and evaluation, each with the `datasets.Dataset` type, which is already defined in the `flytekitplugins-huggingface` library.
🤖 Fine-tuning BERT model
Allocate a GPU to the train task to accelerate fine-tuning. Initialize `secret_requests` to retrieve the 🤗 token. Load the pre-trained `bert-base-uncased` model, initialize `TrainingArguments`, and use `Trainer`to train the model using tokenized data. Finally, publish the trained model to 🤗 hub, and return the GitHub SHA of the published model, Flyte execution ID (to maintain lineage), and repository name.
Note: To store secrets in Flyte, you can utilize a secrets management system, with Kubernetes secrets being the default. In the following section, we’ll see how to create a secret.
🗄️ Sending model metadata to GitHub
Because Banana must be deployed through GitHub, locate all Banana deployment files in a GitHub repository. The deployment is triggered by a push event once a Banana GitHub action is included in the appropriate repository.
Initiating Banana deployment from within a Flyte workflow requires triggering a push event in this instance, we will push the model metadata.
Take the model metadata retrieved from the `train` task and add it to a `model_metadata.json` file. When transmitting the data through the GitHub API, it must be converted to base64 encoding. Prior to generating the commit, obtain the most recent commit SHA to be used when sending the push event. Utilize the subprocess library to hit the `createCommitOnBranch` endpoint.
Refer to the end-to-end pipeline available on the GitHub repository for a comprehensive overview.
▶️ Running the pipeline locally
In order to execute the model locally, you must first store the secrets in a local file.
- Obtain access tokens for HuggingFace and GitHub and store them in files within a secrets directory, as follows:
- Within a `.env` file, set the following two variables to enable local code execution:
Next, install the necessary requirements in a virtual environment.
To run the Flyte workflow, use the following command:
The pipeline retrieves the data, tokenizes it and trains a model. Then it waits for the user to approve the push event before pushing the model metadata to GitHub, which would trigger a Banana deployment.
🍌 Serving on Banana
Banana is an ML inference solution that does inference on serverless GPUs.
🛠️ Setting it up
To activate the Banana deployment, you need to:
- Create an account.
- Configure your Banana account by creating a deployment and linking it to the forked GitHub repository. To do this, navigate to Team > Integrations > GitHub > Manage Repos and Deploy > Deploy from GitHub > your repository.
- That's all! Every push to the configured GitHub repository will now trigger a deployment.
🪚 Adding code for inference
The Banana inference code needs to be encapsulated in the `app.py` and `server.py` files.
In the `app.py` file, the following steps should be taken:
- Define an `init()` method. This should involve fetching the model from the 🤗 hub and loading it onto a GPU.
- Define an `inference()` method. This should involve tokenizing the user-given prompt and generating a prediction.
Refer to the `server.py` file code on GitHub.
🧪 Testing the endpoint
To run the Banana server locally, execute the command `python server.py`. Confirm that the Banana API endpoint is functional by running the following test case:
Note: Install the Banana requirements prior to running the command.
📤 Deploying the model
To deploy the model on Banana, prepare a Dockerfile and place it at the root of your GitHub repository:
Once the files are in place, run the Flyte pipeline locally again. Approve the model metadata push to GitHub, and the model should be built and deployed on Banana!
🫐 Running the pipeline on Flyte cluster
Create a Dockerfile that includes the necessary requirements to package and register your Flyte workflow.
Create Kubernetes secrets to store GitHub and HuggingFace tokens as follows:
To run the Flyte workflow on an actual Flyte backend, set up a Flyte cluster. The simplest way to get started is by running `flytectl demo start` command, which spins up a mini-replica of the Flyte deployment.
Register tasks and workflows using the following command, which can leverage the docker registry included with the demo cluster for image pushing and pulling:
And then, launch the registered workflow on the UI. To deploy the retrained model on Banana, click "Approve." This action saves the model metadata in the GitHub repository, and the push action triggers a deployment on Banana.
🧪 Testing the Banana deployment
To test your deployment, retrieve the API and model keys. Store these keys in your local environment, and then execute the following test:
Upon running the code, you should observe a predicted label being returned.
Hopefully, you’ve learned a bunch about Flyte and Banana! Orchestration and inference can go hand-in-hand seamlessly as demonstrated in this piece. You can ensure your ML pipelines are versioned, cached and reproducible, and at the same time, run online inference at scale on GPUs.
Here are some key takeaways from this application:
- Ensure coherence between retraining ML models and deployment
- Human-in-the-loop can power your deployment with Flyte orchestration
- GPU-powered serverless inference with Banana
- Comprehensive data lineage across the entire model development and deployment pipelines for easier debugging
- Every model is versioned through 🤗 hub
- Flyte versioning ensures versioned ML pipeline executions
Flyte and Banana offer the potential to create production-grade ML pipelines with ease. If this resonates with your needs, I encourage you to try these tools.