How Alectio Migrated from Airflow to Flyte to Build the First MLOps Pipeline for Data-Centric AI
By Jennifer Prendki — founder and CEO of Alectio, a company created to automate the cleaning and optimization of trained datasets.
In 2019, the Alectio team embarked on a mission: to operationalize and automate data preparation for machine learning.
I began to envision this mission about three years earlier, in 2016, when I was working at WalmartLabs (the retail giant’s e-commerce operation).
My manager unexpectedly asked me to take the lead of the team coordinating data annotation for all search-relevance initiatives with external, third-party vendors. Within a few weeks, my initial belief that this part of my job would be trivial turned to uncertainty about taking on the responsibility in the first place when I was asked to deliver annotations for 20x the original amount of data with barely any increase in budget.
But there was a silver lining to this particular challenge: It led me to fall in love with active learning (AL), a data-centric AI technique that dynamically selects records worthy of annotation. It was a nice, elegant approach to training a model without breaking the bank. I enjoyed it so much, I eventually started Alectio to make AL more practical and accessible to data scientists around the world! (That took a few years of tergiversation, however.)
When people discuss the challenges of AL, they often talk about how tedious it is to select the most useful data dynamically. Indeed, it’s terribly difficult to get it right. However, they tend to overlook a very important point: Active learning is a sequential learning technique. It involves iterative retraining of the model, and it requires a much more complex MLOps pipeline than off-the-shelf solutions support.
Building such a workflow was Alectio’s first hurdle, as we knew from the beginning that everything else would depend on it.
The AL process itself is a cyclical repetition of four steps:
At first, it felt like building such a pipeline would be a piece of cake for our awesome engineers and wouldn’t cause much trouble. And so, like many MLOps teams back then, they started the implementation in Airflow, the default workflow execution engine on the market.
We built an initial version of an AL pipeline, and users started experimenting with it. But before long, processes started freezing, users were reporting irreproducible errors, and the engineering team got tired of it all. They felt that Airflow was not serving their purpose well.
That’s when our engineers went on a quest for the best alternative to Airflow. After a few days of frantic research, they became convinced that Flyte had everything they needed. The rest is history.
The engineering team came to me and demanded we migrate the entire system to Flyte. Any other feature release on our roadmap would have to wait. After all, they reasoned, there was no point in adding more DataPrepOps tools if the most fundamental functionality of the Alectio platform — the AL workflow — was slow, inefficient and utterly flawed.
Like many founders at an early-stage startup, I was sensitive to customers’ desire for new features. However, I concurred with the engineers’ assessment and decided to invest the necessary time to migrate from Airflow to Flyte. It’s not like I really had the choice anyway: Had I said no, I would have ended with an engineering mutiny on my hands. (Believe me, that’s something no founder wants to have to deal with!)
To my great surprise, the migration to Flyte was as smooth and easy as the development of our initial AL pipeline in Airflow had been painful: It literally took just a few weeks to revamp our platform’s main pipeline entirely, to the delight of users and developers alike.
So what did the Alectio engineering team find so powerful and special about Flyte? Here’s a short list of key reasons why we at Alectio love Flyte, and why we would switch away from Airflow all over again:
Lack of workflow-versioning capabilities in Airflow Scheduler
From an ML perspective, the SELECT step of AL (known as the “querying strategy”) is the trickiest one, and there are countless ways to implement it. In its simplest form, the SELECT step takes the output of INFER as an input. But many querying strategies function differently. For example, Alectio has long experimented with using time series of all past INFER steps as an input to the next SELECT task. In short, that means that there are many flavors of AL supported by very slightly different workflows.
Because Airflow does not offer a workflow versioning feature, it means that every single flavor of AL, every use case, might require its own workflow. What’s more, these variations couldn’t be reused or shared among developers. Every minor deviation required the creation of a whole new workflow, which was not acceptable given the volume of different querying strategies studied by Alectio’s ML team.
Tasks could not be reused in Airflow
Making tasks reusable in Airflow requires creating a custom plugin. If you want a custom operator, executor or macro in Airflow, you have no other choice than importing it to your own plugin. How crazy is that? By comparison, Flyte tasks are actually first-class citizens and can be shared and reused without another thought.
Airflow did not “understand” data flows
That means data is not “naturally” shared between tasks. Whether the same data source is to be consumed by multiple tasks or the data output of one task is to be fed to the next one, each task must load that data source separately. Flyte, however, uses its own strict type system created in protobuf, which allows it to reuse and transfer data smoothly from one task to another.
We didn’t like that Airflow was not Kubernetes-native
Like most ML engineering experts, we love running things on Kubernetes. The fact that Flyte is Kubernetes-native avoids a lot of hassle and headaches when setting up the configurations of our Kubernetes cluster.
Airflow adopted a multi-tenant architecture very late
As Alectio started growing as an organization, we were running more and more AL processes simultaneously. A lot of them used the exact same setup, hence the exact same workflow. A monolithic architecture would require us to duplicate the workflow for every customer and every one of their experiments. We weren’t able to scale horizontally with Airflow, and customers noticed the added latency. Flyte’s multi-tenant architecture was exactly what was needed from the beginning.
Airflow was much harder to distribute
With Flyte, the execution engine and scheduler are entirely decoupled. This means the user can choose to have every single workflow independently executed on a different execution engine. It makes a huge difference when many simultaneous projects run on the Alectio platform. This feature allows users to distribute workflows and scale easily based on traffic.
Plugins can be a pain to manage in Airflow
The Alectio engineers were not the first ones (and certainly not the last) to complain about it, but oftentimes, plugins just wouldn’t refresh in Airflow. Sometimes, the only resolution to the problem was simply to restart the scheduler. Needless to say, this type of issue was hard to reproduce and debug. No engineer wants to have to tell a customer to “just restart a job,” especially more than halfway through an AL process. Switching to Flyte solved this problem for good.
Though there’s more, I believe this list of Flyte’s advantages over Airflow is enough to explain why Flyte was the clear winner for us!
Our mission to deliver intuitive and off-the-shelf AL and data-centric AI solutions to the ML community is certainly not an easy one, and we keep working hard every single day to meet the demand, but one thing is for sure: Flyte, as well as the Flyte team and its outstanding support, has made that mission easier — and we’re grateful for that.