David Espejo
David Espejo

Declarative infrastructure: The power of unmatched reliability

Machine learning pipelines fail — that’s understood. It’s not always trivial, though, to identify where a particular failure came from, especially considering how complex an ML system can be. 

During the USENIX OpML 2020 conference, two engineers from Google shared the results of a study on postmortem metadata for one of the largest ML pipelines at Google, one that has been working (and sometimes failing) for more than 15 years. They found that 62% of failures came from outside the ML system; of that group, “the majority (60%) at least partially resembled problems that are characteristic of distributed systems.” According to the taxonomy they developed for the study, those problems refer mainly to “systems orchestration: which process to run where” and also resource management.

While these are the conclusions for one particular system, the study throws light on what could be a more generalized pattern. In the words of Chip Nguyen:

A reason for the prevalence of software system failures is that because ML adoption in the industry is still nascent, tooling around ML production is limited and best practices are not yet well-developed or standardized.1

Indeed, software engineering patterns have been around for a while, maturing and growing in adoption. Therefore, it could be useful for ML workloads to adopt some of them, reaping their potential benefits including resilience; scalability; and, in sum, agility. 

In this post, we’ll cover an approach to designing and operating infrastructure that consistently confers those advantages to applications being managed by modern orchestration systems.

The reconciler pattern

At its core, the idea is simple: The user declares their intent, and a mechanism will read it and make it happen. The intent here is the “declared state” or the representation of the expected state of the infrastructure. The mechanism is a “controller” that implements the reconciliation logic: calling the declared state, comparing it with the actual state and reconciling both representations. If there’s any subsequent change — or mutation — in the declared state, the controller should immediately mutate the necessary resources in the infrastructure to match both representations.

The code implementation of the idea is also simple enough. This is how a reconciler would look like in Go2:

Copied to clipboard!
type Reconciler interface {
  GetActual() (*Api ,error)
  GetExpected() (*Api, error)
  Reconcile(actualApi, expectedApi *Api) (*Api, error)
}

A logical overview of the pattern is also shown in the diagram:

The power of this pattern comes from its simplicity and especially from the guarantees it gives to the user: They will get what’s expected or an error. This is interesting because it doesn’t only set a baseline for reliable infrastructure, but it shifts the paradigm for platform engineers who now will manage infrastructure by managing an app.

Nevertheless, some concerns may arise: How would such a controller behave in the presence of failures? How would it handle consistency between the two representations: declared and actual? To illustrate it better, I’ll have to refer to a concept from electrical engineering.

Level-triggered vs edge-triggered systems

Consider an input signal (or a set of states for a particular resource) that can only be either ON or OFF. When a system is edge-triggered, it will be triggered only when there are changes to the input signal. On the other hand, if the system is level-triggered, it will react if the input signal changes to a state different from the last observed.

Under ideal conditions (this is, no failures) it’s hard to see why either of the two options would be preferable. 

Consider, then, a scenario where two network partition events occur at moments when the input signal changed:

Diagrams inspired by hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d 

For an edge-triggered controller, it will be impossible both to detect and restore the last observed state. Due to this limitation, it will render an inconsistent representation of the input signal states.

A level-triggered controller, though, will store the last observed state even during partitions or failures. When it’s able to detect the change, it will update the observed state accordingly. In consequence, the rendered signal will be a more accurate representation of the input signal. Yes, it will have some delays, but it will be eventually consistent. This is the price tag for prioritizing Availability over Consistency in the face of Partitions (see CAP theorem).

By now, you may be wondering how all this theory fits into ML/Data workloads. Now we’ll explore a particular implementation of level-triggering and the reconciler pattern that addresses this concern.

Flyte Propeller: A controller for ML/Data workloads

One of the core components of the Flyte platform is Propeller, a Kubernetes controller that implements the reconciler pattern. Propeller reads the declared state  (the workflows, tasks and integrations a Flyte user expects to run), and it guarantees that, even in the presence of failures or partitions, it will eventually reconcile the observed state with the desired state. 

Summarized view of Flyte’s logical architecture. See the documentation for more details.

The fact that Propeller interacts with the Kubernetes API, which is essentially a level-triggered system, provides multiple guarantees to users:

A consistent data structure for all inputs and outputs

There’s a common characteristic of general-purpose orchestrators (including Kubernetes): They avoid loops in the object representation data structure by making the reconciler iterate through resources in a linear fashion using sets instead of graphs. For ML/Data workloads, this is a limitation, especially considering the iterative nature of most ML use cases (i.e. Batch Inference) where loops for re-training are necessary. In Flyte, workflows are represented as Directed Acyclic Graphs (DAGs) and this is a data structure Propeller understands. Flyteadmin translates the DAG definition into a Custom Resource Definition (CRD), an object representation that the Kubernetes API can handle, while the execution logic (order, branching, et al.) remains controlled by Propeller.

Immutable resources (workflow executions)

According to the Flyte documentation:

“Each execution of a workflow definition results in the creation of a new FlyteWorkflow CR (Custom Resource) which maintains a state for the entirety of processing.”

A resource that “maintains a state for the entirety of processing” is a good description of a level-triggered system, and it also points to another reason to trust the reconciler: it will enforce immutability by triggering the creation of a new resource every time there’s a change in the expected state (executing a workflow). This principle provides users with a strong guarantee regarding the reliability of experiment results by preventing mutations or changes to past workflow executions.

Not only that, but Propeller will memoize task executions using the DataCatalog component, adding the ability to trace back what’s been executed and rerun it, enabling the reproducibility and traceability that most ML workloads need today.

What does this all mean for ML teams?

In the quest to productionize ML workflows, which is still a challenge for many, adopting software engineering patterns can produce outcomes like Reproducibility; Traceability; and, in sum, Reliability.

Flyte takes an opinionated approach to address the gap between ML and Software Engineering by tapping into modern orchestration patterns that all work under the hood without requiring code changes or major alterations in the development process. As a result, Flyte shifts the paradigm again to enable ML/Data engineers to define the infrastructure they need by defining an app.

References

  1.  Nguyen, C. “Designing Machine Learning Systems”,  p. 228. O’Reilly, 2022.
  2.  A more detailed and commented implementation is shown in the book Cloud Native Infrastructure by Justin Garrison and Kris Nova, 2018. Freely available at: https://www.vmware.com/content/dam/learn/en/cloud/pdf/Cloud_Native_Infrastructure_eBook.pdf