Ray Serve Overview

Ray Serve is a powerful component of the Ray ecosystem designed for efficient and scalable model serving.

Introduction to Ray and its Ecosystem

Ray offers a scalable and flexible runtime for the entire ML lifecycle through the Ray AI Runtime (AIR) (opens in a new tab) initiative. The project seamlessly integrates with other top-tier libraries to form a best-of-breed MLOps ecosystem (opens in a new tab). Ray Serve (opens in a new tab), a particular component of this ecosystem, bridges the divide between the backend and ML engineering by providing a clear abstraction between development and deployment.

Ray Ecosystem

Defining Ray Serve

Ray Serve is an advanced model serving library built for crafting online inference APIs. Serve is framework-agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic. It has several features and performance optimizations for serving Large Language Models such as response streaming, dynamic request batching, multi-node/multi-GPU serving, etc.

Key features include:

Scalability: Horizontally scale across hundreds of processes or machines, while keeping the overhead in single-digit milliseconds
Multi-model composition: Easily compose multiple models, mix model serving with business logic, and independently scale components, without complex microservices.
Batching: Native support for batching requests to better utilize hardware and improve throughput.
FastAPI Integration: Scale an existing FastAPI server easily or define an HTTP interface for your model using its simple, elegant API.
Framework-agnostic: Use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow and Keras, to Scikit-Learn models, to arbitrary Python business logic.

Ray Serve operates on the Ray distributed computing platform, leveraging its capabilities to efficiently scale across numerous machines within datacenters or cloud environments.

Dive into the Ray Serve Quickstart (opens in a new tab) to kickstart your journey.

Ray Serve in the ML Serving Landscape

Ray Serve Position

The ML serving realm often presents a dichotomy between development ease and readiness for production:

Web Frameworks: While simple systems like Flask or FastAPI provide ease of use, they may not scale efficiently.
Custom Tooling: Transitioning to custom solutions might address scalability but can be challenging to develop, deploy, and manage.
Specialized Systems: These systems are excellent for ML management and deployment but can be rigid and have steeper learning curves.
Ray Serve: Positioned as a specialized web framework for ML serving, Ray Serve offers ease of use, seamless deployment, and production readiness.

What makes Ray Serve Different?

Ray Serve Differentiators

There are so many tools for training and serving one model. These tools help you run and deploy one model very well. The problem is that machine learning in real life is usually not that simple. In a production setting, you can encounter problems like:

Wrangling with infrastructure to scale beyond one copy of a model.
Having to work through complex YAML configuration files, learn custom tooling, and develop MLOps expertise.
Hit scalability or performance issues, unable to deliver business SLA objectives.
Many tools are very costly and can often lead to underutilization of resources.

Scaling out a single model is hard enough. For many ML in production use cases, complex workloads require composing many different models together. Ray Serve is natively built for this kind of use case involving many models spanning multiple nodes.

Ray Serve in Practice

Ray Serve (opens in a new tab) stands out with its efficiency, scalability, composability, and adaptability. Built atop Ray, it supports stateful and asynchronous messaging. This eliminates the need for complex integrations across various services with Redis/Kafka for example.

Using Ray Serve, you can build your real-time pipeline (opens in a new tab) via the Deployment Graph API (opens in a new tab). With this, you can harness Python's expressiveness, test on local setups, and deploy on clusters without code alterations.

Ray Deployment API

Ray will automatically schedule and allocate your tasks efficiently across many worker nodes across the cluster. In Ray, you can define heterogeneous worker node types, allowing you to take advantage of popular spot pool capacity. Also, you can make use of fractional resource allocation (ray_actor_options={"num_gpus":0.5}). This allows you to overprovision tasks/actors on the same host and maximize your premium resources or AI accelerators such as GPU, provided that they are not using all the resources when run concurrently.

Post-development, Ray Serve lets you generate a YAML configuration (opens in a new tab) for operationalizing your pipelines. YAML has become the format of choice for deployment resources and workloads, especially on Kubernetes, and allows you to standardize on all the known deployment patterns such as canary, blue/green, and rollback.

Deployment Strategy

Companies like Widas have leveraged Ray Serve to simplify their technical stack (opens in a new tab) from microservices, Redis, and Kafka to implement a complex online video authentication service. Moreover, other Ray libraries, like Ray Datasets, Ray Train, and Ray Tune, further simplify the MLOps cycle across major ML frameworks by unifying and consolidating scalable data preprocessing, tuning, training, and inference. Uber, for instance, employs Ray (opens in a new tab) to optimize their end-to-end ML workflows for enhanced cost performance.

In the MLOps landscape, Ray Serve alleviates challenges related to infrastructure provisioning, tech stack complexity, and team collaboration. By unifying multiple frameworks, Ray eases the entire ML process - you no longer need to use a workflow orchestrator to unify all the steps, you just need one Python script. Ray Serve also accelerates the last mile deployment; ML engineers can design and deploy real-time pipelines without altering code all the way from development to production on a cluster.

Resources:

Simplify your MLOps with Ray & Ray Serve (opens in a new tab)

Serving ML Models in Production Ray Serve & FastAPI