min read

•

Dec 10, 2024

Simple is beautiful: Revolutionizing GNN Training Infrastructure at LinkedIn

During December's Flyte Community Sync, Shuying Liang shared how LinkedIn's AI platform team has developed a groundbreaking approach to managing distributed deep learning infrastructure, specifically for Graph Neural Network (GNN) training. The solution tackles complex challenges in data service orchestration through the Flyte Agent Framework.

The Challenge

Graph Neural Networks are critical for understanding complex relationships across LinkedIn's professional networks. However, training these models at scale involves intricate data loading, sampling, and processing across multiple nodes and GPUs. The missing piece is the infrastructure to support how and where to run these Kubernetes data services, making them scalable and reliable along with the training or inference processes.

Diagram of how a Graph Neural Network Training is organized including the Sample, Gather and Traing stages. Includes the tools used for training on GPUs but a quotation mark on what to use to manage data sampling and transformations on CPUs. — Challenges of GNN Training infrastructure. Credit: Shuying Liang's presentation

The LinkedIn Approach

LinkedIn has been migrating all their training pipelines to Flyte, leveraging the existing Flyte integrations to build a complete production AI platform.

To streamline the provisioning and scaling of datasets, LinkedIn first explored developing a Flyte backend plugin. However, they found significant advantages in the Flyte Agents framework to extend their platform with simplified testing, integration, and configuration rollout processes.

LinkedIn developed a custom Flyte Agent that:

Decouples business logic from infrastructure complexities
Simplifies Kubernetes orchestration
Enables scalable, reliable data services for deep learning
Works across multiple LinkedIn data centers in Texas and Virginia

Green boxes are new components of LinkedIn's GNN Infrastructure. Credit: Shuying Liang's presentation.

Key Technical Approach

Rather than complicating Flyte backend plugin development with a Kubernetes operator, LinkedIn utilized the Flyte Agent Framework to:

Harness Kubernetes-native constructs like StatefulSets and Services, to reliably provision and scale data services.
Offer flexible task definitions for users, enhancing adaptability.
Simplify CI/CD processes for custom agent deployment.

The Flyte Agent is already part of LinkedIn's CI/CD process.

This approach focuses on simplified APIs and workflow management, simplifying the end-to-end user experience, with certain key elements:

Data Service Configuration

DataServiceTask is the DSL LinkedIn created, enabling users to define infrastructure parameters such as resource requirements, number of replicas, and node execution commands
Uses DeepGNN for graph data services and feature aggregations

Training Job Execution

No changes to existing training job processes
Environment variables are used to specify data service endpoints in training jobs
Allows reusing data services across different workflows and training runs

System Architecture

Agent framework dispatches task requests to custom agents
Agents implement callback functions to create, retrieve, and delete Kubernetes native constructs such as StatefulSets and Services
Backend plugin (integrated with Kubeflow) manages training pods
Provides stable and predictable endpoints for communication between training and data services

With these elements in place, LinkedIn provides a flexible, reliable platform for productionalizing machine learning training environments with easy-to-use APIs and efficient resource management.

E2E User Experience with LinkedIn's GNN Training platform. Credit: Shuying Liang's presentation.

Future Outlook

LinkedIn plans to open-source its data service solution, contributing back to the Flyte open-source community and helping other organizations streamline their deep learning infrastructure.

Stay tuned for more details as LinkedIn prepares to share its implementation with the world.

Watch the complete presentation

Join the Flyte community

Get started with Flyte

Simple is beautiful: Revolutionizing GNN Training Infrastructure at LinkedIn

The Challenge

The LinkedIn Approach

Key Technical Approach

Future Outlook

Table of Contents

Product

Use Cases

Resources

Compare

Ecosystem

Simple is beautiful: Revolutionizing GNN Training Infrastructure at LinkedIn

The Challenge

The LinkedIn Approach

Key Technical Approach

Future Outlook

More from Flyte

Table of Contents

Product

Use Cases

Resources

Compare

Ecosystem