Samhita Alla

min read

•

May 22, 2023

Release

Performance

Distributed Training

Flyte 1.6: Runtime Metrics UI, ImageSpec, External Backend Plugins and More

We’re excited to announce the release of Flyte 1.6! This latest version brings a plethora of exciting new features and notable improvements that showcase the remarkable contributions of the Flyte community. Let's dive into the highlights of what Flyte 1.6 has to offer.

Runtime metrics UI

While the timeline view provided some insights into performance debugging, there was a clear need for more detailed and comprehensive information. The integration of the runtime metrics breakdown marks a significant step forward in addressing this need.

The runtime metrics breakdown enhances the timeline view by categorizing node executions into a collection of time-series data, providing a more granular understanding of workflow performance. Inspired by systems like Jaeger, which employ time-series telemetry data with events, this solution enhances the timeline view with tick marks that display explanatory messages when you hover over them.

Human-readable messages provide valuable insights into the reasons behind reported execution statuses. For example, an execution may consume significant overhead for front-end plugins, causing a delay between Flyte starting the task and the back-end service signaling its initiation. Based on that feedback, users can infer potential issues such as scheduling contention or prolonged image pull times. Such information, previously only accessible to Kubernetes experts through FlytePropeller, now becomes more readily available to users.

`ImageSpec`: Build images without a Dockerfile

Flyte 1.6 brings an exciting new feature called `ImageSpec.` Building container images traditionally involved crafting intricate Dockerfiles to specify the desired configurations, dependencies and environment setup. This approach required in-depth knowledge of Docker and added an extra layer of complexity to the development workflow. However, `ImageSpec` changes the game entirely.

Using `ImageSpec,` you can define and build container images for your Flyte tasks and workflows effortlessly. It provides a straightforward and intuitive way to specify the necessary components without the need for a Dockerfile. You can define essential elements like Python packages, APT packages and environment variables directly within the `ImageSpec` configuration.

Copied to clipboard!

import pandas as pd
from flytekit import ImageSpec, Resources, task

pandas_image_spec = ImageSpec(
    base_image="ghcr.io/flyteorg/flytekit:py3.8-1.6.0",
    packages=["pandas", "numpy"],
    python_version="3.9",
    apt_packages=["git"],
    env={"Debug": "True"},
)

sklearn_image_spec = ImageSpec(
    base_image="ghcr.io/flyteorg/flytekit:py3.8-1.6.0",
    packages=["scikit-learn"],
)

if sklearn_image_spec.is_container():
    from sklearn.linear_model import LogisticRegression



@task(container_image=pandas_image_spec)
def get_pandas_dataframe() -> (pd.DataFrame, pd.Series):
    df = pd.read_csv(
        "https://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
    )
    print(df.head())
    return df[["age", "thalach", "trestbps", "chol", "oldpeak"]], df.pop("target")


@task(container_image=sklearn_image_spec, requests=Resources(cpu="1", mem="1Gi"))
def train_model(
    model: LogisticRegression, feature: pd.DataFrame, target: pd.Series
) -> LogisticRegression:
    model.fit(feature, target)
    return model

`ImageSpec` lets you customize and optimize your container images for specific use cases. You can easily specify the desired image composition and include only the necessary packages and dependencies, reducing image size and minimizing unnecessary overhead.

Detailed documentation and a recorded video demonstration of this feature will be available soon.

External backend plugins

Flyte 1.6 introduces an external plugin system, a new component designed to address the challenges of implementing backend plugins within the Flyte ecosystem. This feature aims to simplify authoring, testing and deploying plugins particularly for data scientists and ML engineers who may not have expertise in Golang.

Key goals of the external plugin system include:

Easy plugin authoring: Users can author plugins without code generation or unfamiliar tools, empowering MLEs and data scientists.
Support for communication with external services: You can focus on enabling plugins that seamlessly interact with external services.
Independent testing and private deployment: Test plugins independently and deploy them privately, for flexibility and control over plugin development.
Backend plugin execution in local environments: Users, especially in Flytekit and UnionML, can leverage backend plugins for local development, streamlining the development process.
Language-agnostic plugin authoring: Author plugins in any programming language; users can work with their preferred language and tools.
Scalability: Plugins are designed to be scalable to to handle any size workload effectively.
Simple API: Plugins offer a straightforward API, making integration and usage straightforward for developers.
UI integration: Plugins visible in the Flyte UI provide additional details and enhance the user experience.

The detailed documentation of this feature will be available soon.

Pyflyte prettified

The pyflyte CLI is a versatile tool that caters to a wide range of tasks, including local code testing, remote execution, and execution monitoring. To enhance the user experience, we’ve incorporated the rich-click library to prettify the logs generated by the pyflyte CLI.

With the integration of rich-click, the pyflyte CLI now offers improved and visually appealing output. The logs are presented in a more readable and structured format, making it easier for users to understand and debug their code. The enhanced log presentation significantly streamlines the development process, allowing for more efficient troubleshooting and error identification.

Enhanced task execution insights

We’re excited to introduce enhanced execution insights that provide you with detailed information about the execution time of various components within a task, including both Flytekit and user-defined parts. This feature offers a valuable tool for optimizing your workflows and improving overall performance.

A standout aspect of this enhancement is the ability to visualize the execution timeline. By simply setting the `disable_deck` parameter to `False`, you can generate a comprehensive timeline graph that showcases the duration of different components involved in task execution.

Copied to clipboard!

from flytekit import task, Resources, workflow
from flytekit.core.utils import timeit


@task(
    disable_deck=False,
    limits=Resources(mem="4Gi", cpu="1"),
)
def t1():
    import time

    # Test timeit used as a decorator
    @timeit("Download data from s3")
    def download_data():
        time.sleep(1)

    download_data()

    # Test timeit used as a context manager
    # Simulate a user using a very long name for a time measurement
    with timeit("Test long string " * 20):
        time.sleep(1)

    # Simulate multiple time measurements
    for i in range(10):
        with timeit(f"Run small tasks {i}"):
            # The 'print' function will execute quickly compared to 'time.sleep(1)'.
            # This is used to test if the timeline graph can accurately represent
            # measurements with significantly varying execution times.
            print("hello world")


@workflow
def wf():
    t1()


if __name__ == "__main__":
    wf()

Explore and analyze the time taken by each component in a task with FlyteDecks

Lazy loading of Flytekit dependencies

Traditionally, when working with Flytekit, all the dependencies specified in your code are loaded upfront, regardless of whether they are needed or not. This approach can lead to unnecessary resource consumption and longer initialization times, particularly when dealing with large or rarely used dependencies.

With the introduction of lazy loading, Flytekit now handles dependency loading more efficiently and dynamically. Instead of loading all dependencies up front, Flytekit loads only the modules that are specifically accessed through attributes. This optimized approach brings benefits such as reduced memory usage and enhanced overall performance. We have achieved a remarkable improvement in loading time, reducing it from 6-8 seconds to an impressive 0.25-0.30 seconds for a simple workflow. This significant enhancement translates to a staggering 30x improvement in performance!

PyTorch elastic training (torchrun)

Flyte now offers seamless support for distributed training using PyTorch elastic (torchrun). With this powerful capability, you can easily configure and execute elastic training tasks to scale your training workload efficiently. Let's dive into the details and explore how you can leverage this feature in Flyte.

To begin, let's consider the scenario where you want to perform elastic training on a single node with a local worker group of size 4. In torchrun, this can be achieved using the command `torchrun --nproc-per-node=4 --nnodes=1 ....` With Flytekit, you can achieve the same elastic training setup effortlessly:

Copied to clipboard!

from flytekitplugins.kfpytorch import Elastic


@task(
    task_config=Elastic(
        nnodes=1,
        nproc_per_node=4,
    )
)
def task():
    # Task implementation goes here
    ...

By specifying the `Elastic` task configuration, you can initiate 4 worker processes on a single node, both during local development and remote execution within a Kubernetes pod in a Flyte cluster. This capability enables you to take full advantage of distributed training resources, resulting in faster training times and improved model performance.

But what if you need to perform distributed elastic training on multiple nodes? Flyte has got you covered. Simply modify the `Elastic` task configuration to suit your requirements. Here's an example:

Copied to clipboard!

from flytekitplugins.kfpytorch import Elastic

@task(
    task_config=Elastic(
        nnodes=2,
        nproc_per_node=4,
     ),
)
def train():
    ...

In this configuration, distributed training is performed on two nodes, with each node hosting four worker processes. This capability becomes particularly valuable when training large language models on Flyte, allowing you to efficiently harness distributed resources and tackle complex machine learning tasks.

Other enhancements

Full-list log output now available in the execution detail panel, providing comprehensive visibility into task execution logs.

Fast fail mechanism implemented to detect when task resource requests exceed Kubernetes resource limits.
Introduction of the `activate-launchplan` command in the `pyflyte` CLI, enabling easier management and activation of launch plans.
Log streaming capability added to the Papermill plugin, allowing real-time monitoring of log outputs during notebook execution.

Horovod task support implemented in the MPI plugin. ‍
‍Introduction of the `ImageRender` class for Flyte Decks, enhancing visualization capabilities within Flyte.
Imperative workflow execution now available through the `pyflyte run` command.‍
pyflyte run now supports json/yaml files.
Enhanced compatibility in the Papermill plugin, enabling the use of FlyteFiles, FlyteDirectors, and StructuredDatasets inputs.‍
Allow users to specify TypeTransformers via Annotation.

1.6 Contributors

We would like to extend our sincere gratitude to all the contributors who have made valuable contributions to Flyte 1.6. Your efforts in providing code, documentation, bug fixes and feedback have been instrumental in the continuous improvement and enhancement of Flyte.

‍

We highly value the feedback of our users and community members, which helps us to improve our product continuously. To connect with other users and get support from our team, we encourage you to join our Slack channel. For updates on product development, community events, and announcements, follow us on Twitter to join the conversation and share your thoughts.

In case you encounter any issues or bugs, we request you let us know by creating a GitHub issue. If you find Flyte useful, don't forget to star us on GitHub.

Star Us