David Espejo
David Espejo
David Espejo
David Espejo

Speed up time to production with Flyte on EKS: from automation to multicloud

Introduction

As machine learning workflows become increasingly complex, organizations need robust orchestration platforms to handle resource-demanding workloads efficiently. In this post, we'll explore how to deploy Flyte, the highly scalable open source AI orchestration platform on Amazon's Elastic Kubernetes Service (EKS) using Terraform, examining both the benefits and challenges.

<div class="text-align-center heading-style-h3">Union is a managed, optimized version of Flyte.</div>
<div class="padding-bottom padding-small"></div>
<div class="button-group is-center"><a class="button" href="https://www.union.ai/demo" target="_blank">Book a demo to learn more</a></div>

Understanding the components

Thanks to its native reliability and efficiency features, Flyte can orchestrate workflows on compute environments with diverse resource profiles, including edge devices. Nevertheless, enterprises often require the observability, high availability, and resource elasticity cloud providers offer.

Union maintains a Terraform codebase for highly automated production-grade Flyte deployments. In this post, we’ll explore the details of what components get deployed and why when you use those scripts.

General overview of a Flyte deployment on EKS performed with the Terraform scripts

Database

Reproducibility, or the system's ability to reproduce the results of a particular experiment given the same original conditions, is a key design principle for reliable ML pipelines. Flyte stores the inventory of the artifacts produced during the lifecycle of a workflow in a database. This includes executions, launchplans, projects, and resources. To this end, the Terraform scripts deploy an Amazon Aurora PostgreSQL 14.9 database engine, configured with 2vCPU and 4 GiB of memory and access credentials generated randomly.  The connection between Flyte and the database is handled automatically by Terraform, with the access credentials stored in a Secret, making this a zero-touch configuration step.

Compute

Flyte runs as a Deployment in a Kubernetes cluster. This ensures high availability and access to resource pools using native Kubernetes abstractions.

The Terraform scripts deploy an EKS cluster with a 3-node compute pool that can automatically scale up to 5 nodes. You can update the limits here.

The default instance type, `m7i.xlarge`, provides 4 vCPU and 16GB RAM and it’s controlled by the locals.instance_type parameter.

Configuring GPUs

The scripts also automate the configuration of a GPU-powered node pool. To enable it, you need to:

  • Set locals.gpu_count to a number > 1 
  • In case you plan to request specific accelerators, indicate the model here

With the above configuration in place, Terraform will automatically set up the node pool with the labels and taints necessary so you can request the accelerator(s) from the task decorator as shown in the example:

Copied to clipboard!
...
@task(requests=Resources( gpu="1", accelerator=V100,) #NVIDIA Tesla V100

Then, the K8s scheduler will allocate resources based on determining which nodes have available GPU devices that match the request. Learn more about how Flyte orchestrates access to accelerators.

Spot instances

Flyte supports the efficient use of spot instances with Interruptible tasks and intratask checkpoints, which the system uses to resume executions when the compute instance is preempted. By default this is disabled in the reference implementation but you can set locals.spot to true to let Terraform automate the configuration of a node pool with spot instances.

Blob storage

When you register a workflow with Flyte, your code is compiled into a language-independent representation that transports the workflow definition, input, and output types. This representation is then packaged and stored durably in blob storage, from where `flytepropeller`, Flyte’s execution engine, retrieves it to instruct the respective compute framework (e.g., Kubernetes, Spark, etc.) to run the workflow(s) and report status. The raw input data used to train and validate the model is also stored in S3.

The Terraform scripts create a single S3 bucket with default policies to store metadata and raw data. 

Monitoring

This reference implementation includes the Amazon Cloudwatch Observability stack, automatically configured to capture the logs of the containers in the cluster. Flyte uses this configuration to automatically build logging links and display them in the right pane of the UI for each execution. The logging link points to the AWS Console and the specific log stream of that particular execution, accelerating troubleshooting in situations of failure.

Security and access control

Using IAM Roles for Service Accounts, Flyte can map identity between the Kubernetes resources and AWS services it depends on. These configurations are for the backend and do not interfere with user-control plane communication. Hence, they don’t perform RBAC or user access control.

The following figure illustrates how the Flyte reference implementation secures access to AWS resources:

IAM Roles for Service Accounts connects Kubernetes Service Accounts to IAM Roles to enable secure access to AWS resources

From the elements that are part of a Flyte deployment only `flyteadmin` (control plane), `flytepropeller` (execution engine) and `datacatalog` (data memoization service) interact with platform resources directly. In this architecture, each Pod will mount a specific Kubernetes Service Account which is annotated with the ARN of an IAM Role:

Copied to clipboard!
# kubectl describe sa default -n flytesnacks-development
Name:                default
Namespace:           flytesnacks-development
Labels:              <none>
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/flyte-production-flyte-worker
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

Example output of the `default` Service Account that every Flyte execution uses, annotated with the IAM role

An IAM Role binding completes the model by associating the IAM Role with a Service Account and a Policy that specifies the actions the workloads are allowed to do once they log in:

Copied to clipboard!
{
    "Statement": [
        {
            "Action": [
                "s3:PutObject*",
                "s3:ListBucket",
                "s3:GetObject*",
                "s3:DeleteObject*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<ACCOUNT_ID>-flyte-production-data/*",
                "arn:aws:s3:::<ACCOUNT_ID>-flyte-production-data"
            ]
        }
    ],
    "Version": "2012-10-17"
}

What the above policy means is that the Flyte “workers” or the Pods that are spawned for each execution, can only perform a limited number of actions, only on the S3 bucket that the system is using to retrieve inputs and store metadata. 

Networking

The Flyte requirements on the network are basic:

  • Connectivity between the control and data plane (`flyteadmin` and `flytepropeller` essentially)
  • Ability to expose both `http` and `grpc` endpoints to the client

The first requirement is completed using Kubernetes Services which enable intra-cluster communication with a stable DNS name. Every time `flytepropeller` needs to interact with `flyteadmin` (for example, to send execution events) it just connects through `flyteadmin.flyte.svc.cluster.local`

Copied to clipboard!
kubectl get svc -n flyte
NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                          
datacatalog         NodePort    172.20.50.40     <none>        88:31167/TCP,89:31463/TCP        
flyte-pod-webhook   ClusterIP   172.20.135.102   <none>        443/TCP                          
flyteadmin          ClusterIP   172.20.151.125   <none>      80/TCP,81/TCP,87/TCP,10254/TCP 
flyteconsole        ClusterIP   172.20.101.44    <none>        80/TCP    

This principle is also used for multi-cluster deployments where multiple Kubernetes clusters, acting as data plane-only, interact with its control plane through the `flyteadmin` Service.

The Container Network Interface (CNI) plugin that comes with the AWS VPC, allocates one IP-per-pod. While this is manageable for early evaluations, as your platform start to scale, a small subnet can limit the number of pods that can be spun up, and resizing CIDR ranges after the fact can be cumbersome. This implementation uses a /16 CIDR range for the VPC, /16 for private subnets and /24 for public subnets, giving considerable room for growth.

The reference implementation includes the configuration of an Application Load Balancer managed by the AWS ALB Controller. In the EKS cluster, this is materialized as an Ingress resource that routes requests to Services depending on the routes that come configured with Flyte:

Network components deployed by Terraform to securely expose a Flyte deployment

These mechanisms are used to expose the `grpc` and `http` endpoints to the client either for programmatic access or direct interaction using the CLI or UI.

To secure these connections, the reference implementation configures a TLS certificate and terminates the secure connection at the Ingress layer. The certificate is validated with a Route53 DNS record created and managed by external-dns, a controller that automates the configuration of DNS records for Ingress and other Kubernetes network resources, eliminating the need to directly manage and operate DNS records. You just need to go to the locals.tf file and provide the name of a DNS zone managed by your organization.

Flyte installation

In previous versions of the reference implementation, the user was asked to update the values file and install the Helm chart but the process is now completely automated, taking the outputs from all previous steps and feeding the values file. Also, Terraform adds the new EKS cluster to the contexts of your `kubeconfig` file and leaves your CLI connected to the cluster. The only output generated by the deployment process corresponds to the DNS entry assigned to the Load Balancer and that becomes the entry point of your Flyte cluster. From that point on, your cluster is ready to run workflows!

Scaling your Flyte deployment

As you start using Flyte, and especially considering the resource requests of your tasks, you may need to scale your deployment. You can scale out the node groups to have more capacity, or incorporate additional EKS clusters as dataplane-only resources (learn how Flyte supports multicluster)

Extending your infrastructure beyond the limits of a single cloud region and even to multiple cloud providers introduces multiple challenges around data locality, networking, security and more. 

This far exceeds the scope of Flyte but it’s a baseline feature of the Union platform where you can map a specific project or domain to a particular AWS subaccount while another project or domain can be mapped for execution on a GCP account. Union also handles data and metadata isolation depending on your needs to maximize efficiency and security. 

Learn more about multicloud in Union and signup for a demo

Conclusion

Getting started quickly with Flyte can be facilitated by the Terraform/OpenTofu scripts that Union maintains. To take advantage of multicluster, multicloud, cost allocation and other features that help you scale your ML infrastructure without the headaches, consider Union.

<div class="text-align-center heading-style-h3">Union is a managed, optimized version of Flyte.</div>
<div class="padding-bottom padding-small"></div>
<div class="button-group is-center"><a class="button" href="https://www.union.ai/demo" target="_blank">Book a demo to learn more</a></div>