Deploying ML Models in Production Environments

An example step-by-step workflow for containerizing an ML application:

Containerization Workflow for ML Applications

When you're dealing with a straightforward application - one with a single model behind an endpoint - the workflow for containerization can be fairly straightforward. But machine learning in a production setting often demands specialized tooling for data storage, logging, and monitoring, as it requires unique metrics and data sets.

Common Strategies for Serving ML Models

Here, we discuss four prevalent approaches to serving ML models in production: pipelines, ensembles, business logic integration, and online learning.

Common Production Patterns for ML

The Pipeline Pattern

Computer Vision Pipeline Example

The pipeline pattern often involves multiple steps and models to perform a specific task. For example, a typical computer vision pipeline for object captioning might look like this:

Preprocessing of the raw image, such as decoding, augmentation, and clipping.
Detection and classification of the object within a bounding box.
Keypoint detection to identify the object's posture.
Natural Language Processing to generate a descriptive caption.

Examples of Pipelines

Scikit-Learn Pipeline
```
Pipeline([(‘scaler’, StandardScaler()), (‘svc’, SVC())])
```
Scikit-learn's pipeline (opens in a new tab) helps to chain various models and processing steps.
Recommendation Systems Pipeline
```
[EmbeddingLookup(), FeatureInteraction(), NearestNeighbors(), Ranking()]
```
Recommendation systems, like those at Amazon and YouTube, often employ a series of stages such as embedding lookup and ranking.
Preprocessing Pipelines
```
[HeavyWeightMLMegaModel(), DecisionTree()/BoostingModel()]
```
Some use-cases involve heavy-weight models for common preprocessing tasks. For example, at Facebook, groups of ML researchers at FAIR (opens in a new tab) create state of the art heavyweight models for vision and text. Then different product groups create downstream models to tackle their business use case (e.g. suicide prevention (opens in a new tab)) by implementing smaller models using random forest.

Pipeline Pattern Implementation Options

Pipeline Implementation Options

You can either wrap your models in a web server or employ specialized microservices. Each has its pros and cons in terms of scalability and performance.

Wrap Models in a Web Server: Whenever a request comes in, models get loaded (they can also be cached) and run through the pipeline. While this is simple and easy to implement, a major flaw is that this is hard to scale and not performant because each request gets handled sequentially.
Many Specialized Microservices: You essentially build and deploy one microservice per model. These microservices can be native ML platforms, Kubeflow, or even hosted services like AWS SageMaker. However, as the number of models grow, the complexity and operational cost drastically increases.

Another option is implementing Pipelines in Ray Serve. For example, take a look at this pseudocode showing how Ray Serve allows deployments to call other deployments:

@serve.deployment
class Featurizer: ...
 
@serve.deployment
class Predictor: ...
 
@serve.deployment
class Orchestrator
   def __init__(self):
      self.featurizer = Featurizer.get_handle()
      self.predictor = Predictor.get_handle()
 
   async def __call__(self, inp):
      feat = await self.featurizer.remote(inp)
      predicted = await self.predictor.remote(feat)
      return predicted
 
if __name__ == "__main__":
   Featurizer.deploy()
   Predictor.deploy()
   Orchestrator.deploy()

In Ray Serve, you can directly call other deployments within your deployment. In this code above, there are three deployments. Featurizer and Predictor are just regular deployments containing the models. The Orchestrator receives the web input, passes it to the featurizer process via the featurizer handle, and then passes the computed feature to the predictor process. The interface is just Python and you don’t need to learn any new framework or domain-specific language.

The Ensemble Pattern

Ensemble Pattern

One limitation of pipelines is that there can often be many upstream models for a given downstream model. This is where ensembles are useful.

Ensemble Use Case

Ensemble patterns involve mixing output from one or more models. They are also called model stacking in some cases.

Ensemble Techniques

Model Update

In scenarios where a model is trained on live online traffic, there are always new versions of the model in production. To validate the newer model, you may choose to continue to select the output from the known good model for comparison.
Aggregation

Multiple model outputs are combined to form a single, more accurate output; for regression models, outputs from multiple models are averaged. For classification models, the output will be a voted version of multiple models' output.
Dynamic Model Selection

The choice of model can be determined dynamically based on the input. For example, if the input contains a cat, model A will be used because it is specialized for cats. If the input contains a dog, model B will be used because it is specialized for dogs.

Ensemble Implementation Options

Ensemble implementations suffer the same sort of issues as pipelines. It is simple to wrap models in a web server, but it is not performant. When you use specialized microservices, you end up having a lot of operational overhead as the number of microservices scale with the number of models.

With Ray Serve, the kind of pattern is incredibly simple. You can look at the 2020 Anyscale demo (opens in a new tab) to see how to utilize Ray Serve’s handle mechanism to perform dynamic model selection.

The Business Logic Pattern

In any production-grade machine learning system, business logic is inevitable. It's the core operation that interplays with machine learning inference. Business logic patterns involve everything that’s involved in a common ML task that is not ML model inference. This includes:

Database Operations: This involves relational record lookups.
API Calls: Often, we need to fetch data from external services.
Feature Store Access: For retrieving pre-computed feature vectors.
Feature Engineering: Data validation, encoding, and decoding are quintessential.

Business Logic Implementation Options

@app.route("/predict")
def prediction_handler(raw_input):
   model = load_model()
   inputs = [
      db.validate(raw_input),
      feature_store.retrieve(raw_input),
   ]
   output = model.predict(inputs)
   return output

The pseudocode for the web handler above does the following things:

Retrieves the model (from, say, S3).
Validates the input via a database.
Accesses pre-computed features.

However, such a system faces challenges: balancing network-intensive tasks (like model loading and database access) and compute-intensive tasks (like model inference). This duality often results in resource inefficiencies and scaling complexities.

Consider the two common strategies:

Comparing Strategies

Web Handler Strategy (Left): All-in-one approach, prone to resource inefficiencies.
Microservices Strategy (Right): Although it solves some problems, it can complicate the interface between business logic and the model, especially with strict input types like tensors.

Both strategies have their pitfalls.

Advantages of Ray Serve for Business Logic

Ray Serve Logic

Ray Serve offers a middle ground. It suggests offloading computations and using ServeHandle to wrap models, enabling developers to pass regular Python types without the fuss of "tensor-in, tensor-out" logic. Plus, the close proximity of the model deployment class with the prediction handler fosters code comprehensibility.

Ray Serve promotes a division between I/O-bound and compute-bound tasks, thereby allowing for efficient scaling while retaining the ease of deployment.

Leveraging FastAPI with Ray Serve

FastAPI with Ray Serve

Incorporating authentication and input validation is paramount. Ray Serve natively integrates integrates with FastAPI, a web framework renowned for type safety and ergonomic design. FastAPI has features like automatic dependency injection, type checking and validation, and OpenAPI doc generation.

With Ray Serve, you can directly pass the FastAPI app object into it with @serve.ingress. This decorator makes sure that all existing FastAPI routes still work and that you can attach new routes with the deployment class so states like loaded models, and networked database connections can easily be managed.

The Online Learning Pattern

Online learning signifies a growing trend in machine learning where models are updated, trained, validated, and deployed continually in real-time. This evolving approach suits various applications as illustrated below.

Adapting Model Weights in Real-Time

Some applications benefit from real-time adjustment of model weights. This dynamic learning ensures that as user interactions evolve, the model can offer personalized experiences tailored to individual users or user groups.

Real-time Weight Adjustments

A glance at Online Learning as employed by Ant Group (Image credit: Ant Group)

For instance, Ant Group has adopted online learning for a real-time resource allocation solution. The model, initially trained on offline data, gets enriched with live streaming data and then served live traffic. It's imperative to understand that dynamic learning setups are significantly more intricate than conventional static serving systems. Merely hosting these models on web servers or dividing them among microservices doesn't suffice for efficient implementation.

Parameter Learning for Model Orchestration

In addition to weight adjustments, dynamic learning is also pivotal for determining how to effectively orchestrate or combine models. An example of this is determining a user's preferred model (opens in a new tab), which often surfaces in model selection contexts or with contextual bandit algorithms.

Reinforcement Learning: Agents in Action

Reinforcement learning stands as a distinctive subset of machine learning where agents learn by interacting with their environment. This "environment" could be a tangible, physical space or a digital simulation. Dive deeper into reinforcement learning with this introduction (opens in a new tab) and discover how to deploy a reinforcement learning model via Ray Serve here (opens in a new tab).

Scaling with Ray Serve

To illustrate the issues breifly touched on here, consider a real-time processing chain for understanding and tagging a user-uploaded image:

Ensemble of Models for Image Processing

Such a procedure entails pre-processing, followed by multiple model evaluations (on CPU and/or GPU) and subsequent post-processing. This kind of workflow is typical when you wish to classify products in an e-commerce setup (opens in a new tab), for instance.

This raises an essential inquiry:

How do we transform a multi-model ensemble into microservices?

A preliminary, simplistic approach could be to encapsulate the entire pipeline within one application:

Encapsulating Entire Pipeline

However, this method, while feasible, is riddled with limitations:

The application becomes monolithic, necessitating updates for the entire pipeline whenever needed.
The sizing of the application must cater to the collective computation needs, leading to sub-optimal memory usage, as not all resources are active concurrently.
Coarse-grained autoscaling is mandated, which is inefficient.

Such a strategy results in heightened user latency and surging costs due to imprecise resource allocation.

An alternative tactic would be considering each task as an individual service:

Each Task as a Distinct Service

This resolves the inefficiencies of the monolithic structure, yet brings in its own set of challenges:

Defining clear distinctions between development and deployment becomes tricky.
Containerization gets cumbersome with a plethora of tasks, requiring intricate configurations.
Business logic gets entangled within both code and configuration files.
To make this method truly efficient, you'd need additional systems (like microservices, Redis, Kafka, and more). Managing asynchronous message passing and task combinations further complicates the setup.

Both these approaches are sub-optimal. Next we will look into **Ray Serve **, which provides an innate capability to scale and tackle intricate architectures. As "ML in production" often translates to deploying multiple models, Ray Serve is architected with this in mind, building on the distributed foundation of Ray.

The domain of ML serving often prompts a choice between the ease of development and being primed for production. Ray Serve bridges this gap, being tailored for simplicity in development without compromising on its production readiness. It presents a scalable and versatile serving framework to efficiently scale both your microservices and ML models when they're live.

Harnessing Ray Serve, you might find the need to stack your models, employ ensembles, embed business logic, engage with a feature store for real-time data, or even log data - all in a bid to serve a resilient and robust ML solution to your users. This is what ray serve was built to address.

Resources:

Serving ML Models in Production: Common Patterns (opens in a new tab)

Ray AI Runtime (AIR)Ray Serve