Scalable ML Workflows using PyTorch on Kubeflow Pipelines and Vertex Pipelines

Introduction

ML Ops is an ML engineering culture and practice that aims at unifying ML system development and ML system operation. An important ML Ops design pattern is the ability to formalize ML workflows. This allows them to be reproduced, tracked and analyzed, shared, and more.

Pipelines frameworks support this pattern, and are the backbone of an ML Ops story. These frameworks help you to automate, monitor, and govern your ML systems by orchestrating your ML workflows.

In this post, we’ll show examples of PyTorch-based ML workflows on two pipelines frameworks: OSS Kubeflow Pipelines, part of the Kubeflow project; and Vertex Pipelines. We are also excited to share some new PyTorch components that have been added to the Kubeflow Pipelines repo.

In addition, we’ll show how the Vertex Pipelines examples, which require v2 of the KFP SDK, can now also be run on an OSS Kubeflow Pipelines installation using the KFP v2 ‘compatibility mode’.

PyTorch on Google Cloud Platform

PyTorch continues to evolve rapidly, with more complex ML workflows being deployed at scale. Companies are using PyTorch in innovative ways for AI-powered solutions ranging from autonomous driving to drug discovery, surgical Intelligence, and even agriculture. MLOps and managing the end-to-end lifecycle for these real world solutions, running at large scale, continues to be a challenge.

The recently-launched Vertex AI is a unified ML Ops platform to help data scientists and ML engineers increase their rate of experimentation, deploy models faster, and manage models more effectively. It brings AutoML and AI Platform together, with some new ML Ops-focused products, into a unified API, client library, and user interface.

Google Cloud Platform and Vertex AI are a great fit for PyTorch, with PyTorch support for Vertex AI training and serving, and PyTorch-based Deep Learning VM images and containers, including PyTorch XLA support.

The rest of this post will show examples of PyTorch-based ML workflows on two pipelines frameworks: OSS Kubeflow Pipelines, part of the Kubeflow project; and Vertex Pipelines. All the examples use the open-source Python KFP (Kubeflow Pipelines) SDK, which makes it straightforward to define and use PyTorch components.

Both pipelines frameworks provide sets of prebuilt components for ML-related tasks; support easy component (pipeline step) authoring and provide pipeline control flow like loops and conditionals; automatically log metadata during pipeline execution; support step execution caching; and more.

Both of these frameworks make it straightforward to build and use PyTorch-based pipeline components, and to create and run PyTorch-based workflows.

Kubeflow Pipelines

The Kubeflow open-source project includes Kubeflow Pipelines (KFP), a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. The open-source Kubeflow Pipelines backend runs on a Kubernetes cluster, such as GKE, Google’s hosted Kubernetes. You can install the KFP backend ‘standalone’ — via CLI or via the GCP Marketplace— if you don’t need the other parts of Kubeflow.

The OSS KFP examples highlighted in this post show several different workflows and include some newly contributed components now in the Kubeflow Pipelines GitHub repo. These examples show how to leverage the underlying Kubernetes cluster for distributed training; use a TensorBoard server for monitoring and profiling; and more.

Vertex Pipelines

Vertex Pipelines is part of Vertex AI, and uses a different backend from open-source KFP. It is automated, scalable, serverless, and cost-effective: you pay only for what you use. Vertex Pipelines is the backbone of the Vertex AI ML Ops story, and makes it easy to build and run ML workflows using any ML framework. Because it is serverless, and has seamless integration with GCP and Vertex AI tools and services, you can focus on building and running your pipelines without dealing with infrastructure or cluster maintenance.

Vertex Pipelines automatically logs metadata to track artifacts, lineage, metrics, and execution across your ML workflows, and provides support for enterprise security controls like Cloud IAM, VPC-SC, and CMEK.

The example Vertex pipelines highlighted in this post share some underlying PyTorch modules with the OSS KFP example, and include use of the prebuilt Google Cloud Pipeline Components, which make it easy to access Vertex AI services. Vertex Pipelines requires v2 of the KFP SDK. It is now possible to use the KFP v2 ‘compatibility mode’ to run KFP V2 examples on an OSS KFP installation, and we’ll show how to do that as well.

PyTorch on Kubeflow Pipelines: PyTorch KFP Components SDK

In collaboration across Google and Facebook, we are announcing a number of technical contributions to enable large- scale ML workflows on Kubeflow Pipelines with PyTorch. This includes the PyTorch Kubeflow Pipelines components SDK with features for:

Data loading and preprocessing
Model Training using PyTorch Lightning as training loop
Model profiling and visualizations using the new PyTorch Tensorboard Profiler
Model deployment & Serving using TorchServe + KFServing with canary rollouts, autoscaling, and Prometheus monitoring
Model Interpretability using Captum
Distributed training using the PyTorch job operator for KFP
Hyperparameter tuning using Ax/BoTorch
ML Metadata for Artifact Lineage Tracking
Cloud agnostic artifacts storage component using Minio

Computer Vision and NLP workflows are available for:

Open Source Kubeflow Pipelines deployed on any cloud or on-prem
Google Cloud Vertex AI Pipelines for Serverless pipelines solution

Introduction

PyTorch on Google Cloud Platform

Kubeflow Pipelines

Vertex Pipelines

PyTorch on Kubeflow Pipelines: PyTorch KFP Components SDK

Recommend

Intel says one of its 13th Gen CPUs will hit 6GHz out of the box

How Onna Technologies uses Snyk & Sysdig to secure the SDLC while saving tim...

Suggestion: Make Div2As easier

Blue Origin launch aborted after mid-flight anomaly

Prevent log out in codeforces

EV street sweepers may be coming to your city

如何使用Prometheus+Grafana监控Proxmox

iPhone 14 Pro provides greater location accuracy

How to Use Amex Offers: the Best Discounts and Deals Available Now

【Rust2】包、Crate、模块

About Joyk