Flyte 1.8: Enhanced ImageSpec, Integrations and More
We're delighted to announce the availability of Flyte 1.8! This update features a wide range of improvements and bug fixes. Let's dive into the key highlights!
In the Flyte 1.6 release, we introduced a powerful feature called ImageSpec, which enables you to build images without relying on a Dockerfile. With ImageSpec, you can easily define important components such as Python packages, APT packages and environment variables directly within the ImageSpec configuration.
In this latest release, we have made significant enhancements to ImageSpec, offering an even better user experience. One notable improvement is the ability to configure a custom pip index URL, allowing you to customize your package installations.
To make use of this feature, you can take advantage of the `pip_index` parameter when configuring your `ImageSpec`:
We have also added support for CUDA in ImageSpec. Now you can easily include CUDA in your image configuration.
In order to ensure strong consistency and avoid unnecessary image rebuilds, we now cache the names of built images locally. This prevents flytekit from rebuilding the same image, particularly when working with registries that lack strong consistency guarantees.
Additionally, we have introduced the capability to send your `requirements.txt` file directly to ImageSpec. This is particularly useful in maintaining a single source of truth for your dependencies. To accomplish this, simply include the `requirements` parameter when configuring your `ImageSpec`:
With these new enhancements, we aim to provide you with greater flexibility and efficiency when working with ImageSpec in Flyte.
Improved PyTorch elastic integration
PyTorch elastic integration was introduced in Flyte 1.6, enabling effortless configuration and execution of elastic training tasks using `torchrun`. This integration empowers you to scale your training workload efficiently.
In this latest release, we have made the following improvements to PyTorch elastic integration:
- Warning for local torch elastic training with `nnodes` > 1: We now provide a warning when executing a workflow locally while setting `Elastic(nnodes=2)` or any other value greater than 1. This is because the rendezvous of the workers may timeout due to the workers waiting for non-existing workers from the non-existing second node to join.
- Fixed configuration of user-facing execution parameters in spawning elastic tasks: Previously, when using `@task(task_config=flytekitplugins.kfpytorch.Elastic())`, the task function was started in a number of worker processes using torch `elastic_launch` (`torchrun`). However, when using the `Elastic(start_method=...)` argument, there was an issue with the transfer of the Flyte context and execution parameters to the child process. This release fixes this issue by setting up the Flyte context correctly in the spawned worker processes.
- Exclusion of master replica log link during elastic PyTorch training: In torch elastic training, there is no concept of a "master replica" in the resulting `PyTorchJob`, unlike non-elastic PyTorch distributed training. However, the flyteplugins generated a log link for the non-existing master replica during elastic training. This release addresses this discrepancy by excluding the log link for the non-existent master replica in the case of elastic training.
Pandas to CSV and vice versa
We have introduced a convenient feature to facilitate the conversion between Pandas DataFrames and CSV format. The StructuredDataset now comes equipped with a built-in Pandas to CSV encoder and a CSV to Pandas decoder, enabling seamless conversion in both directions.
Take a look at the following example tasks:
- Introduction of `functools.partial` in tasks and map tasks guides: The addition of `functools.partial` allows for setting default arguments in tasks and partially binding values in map tasks.
- Streamlined documentation structure: We have reorganized and revamped the Flyte documentation to ensure a seamless experience for contributors.
- Inclusion of gang scheduling in PyTorch plugin setup: The new section on gang scheduling in the Kubeflow PyTorch operator provides valuable insights to avoid timeout errors in distributed training caused by worker start-time variations due to resource limitations.
- Updated contribution guide: The revised contribution guide makes it easier for new contributors to get started.
- Improved error message for `pyflyte-fast-execute`: We have enhanced the error message handling for `pyflyte-fast-execute` to display only user-related errors without including the details of the called process error. This improvement provides clearer and more concise error messages, making troubleshooting easier.
- Cross-project secrets for GCP: We have introduced support for cross-project secrets in Google Cloud Platform (GCP). You can now securely access secrets stored in a different GCP project. Here's an example:
- We have addressed an issue related to cache misses that occurred during subsequent runs of map tasks. This fix ensures improved caching behavior and enhances the performance of map tasks.
We extend our heartfelt gratitude to all the contributors who have made invaluable contributions to Flyte 1.8. Thank you for your dedication and support!
We highly value the feedback of our users and community members, which helps us to improve our product continuously. To connect with other users and get support from our team, we encourage you to join our Slack channel. For updates on product development, community events, and announcements, follow us on Twitter to join the conversation and share your thoughts.