Comparing Web Servers and ML Serving: Ray Serve & FastAPI

In a world of rapidly maturing ML tools, why choose between a generic web server and a specialized ML server? Can't you have the best of both?

Generic Web Servers: A Look at FastAPI

FastAPI, a shining star among Python web frameworks, emerged from the creator's dissatisfaction with existing tools. Not just another web framework, FastAPI revolutionizes the developer experience:

High-Performance: Rivals even NodeJS and Go in terms of speed.
Developer Efficiency: Speed up feature development by 200% to 300%.
Reliability: Fewer bugs thanks to reduced developer errors.
User-Friendly: Enhanced editor support, intuitive auto-completions, and reduced debugging time.
Learnability: Designed for ease of use; spend less time buried in documentation.
Code Efficiency: Reduce code redundancy and potential bug sites.
Production Ready: Incorporates automatic interactive documentation.
Standards Compliant: Embraces open API standards.

FastAPI is the modern developer's toolkit for building microservices. However, it's essential to remember that it wasn't built for ML model serving.

Specialized ML serving

In the age of colossal ML models and AI accelerators like GPU, TPU, and AWS Inferentia, specialized ML serving frameworks like Seldon Core, KServe, and TorchServe have sprung up. These frameworks:

Prioritize high throughput without sacrificing latency.
Employ model compilation techniques like pruning and quantization to help reduce the model size without sacrificing too much accuracy.
- Thsi reduces the time it takes to compute inference and reduces the overall memory footprint.
Utilize methods like microbatching (opens in a new tab) to maximize the AI accelerator compute and increase throughput without trying to sacrifice latency.
Implement bin packing models to optimally allocate resources.
- allow you to co-locate multiple models on the host and share the resources as those models would not be invoked at the same time, allowing you to reduce idle time on the host
Incorporate "Scale to zero" to optimize resource utilization.
- allows you to release your resources when there is no traffic.
- The tradeoff is that there is typically a cold start penalty as you invoke the model the first time.
Provide autoscaling based on diverse metrics.
- You may want to set your policies based on CPU/GPU utilization threshold, for example.
Include handlers for complex data types, from text and audio to video.

In essence, they're tailor-made for ML professionals to optimize model serving.

Ray Serve & FastAPI: Best of both Worlds

Ray Serve merges the capabilities of a robust web server with a specialized ML server. With Ray Serve, you can seamlessly integrate FastAPI using the @serve.ingress decorator.

Here's a sneak peek:

import requests
from fastapi import FastAPI
from ray import serve
 
# 1: Define a FastAPI app and wrap it in a deployment with a route handler.
app = FastAPI()
 
@serve.deployment(route_prefix="/")
@serve.ingress(app)
class FastAPIDeployment:
    # FastAPI will automatically parse the HTTP request for us.
    @app.get("/hello")
    def say_hello(self, name: str):
        return f"Hello {name}!"
 
# 2: Deploy the deployment.
serve.start()
FastAPIDeployment.deploy()
 
# 3: Query the deployment and print the result.
print(requests.get("http://localhost:8000/hello", params={"name": "Shav"}).json())
# "Hello Shav!"

This synergy lets you harness FastAPI's rich feature set alongside Ray Serve's specialized ML serving capabilities. As previously explored here, Ray Serve facilitates the creation of intricate multi-model inference pipelines. Each pipeline stage can be independently scaled across diverse hardware platforms.

Remember the product tagging/content understanding pipeline?

Multi-model composition

If we were to implement the above use case using the Ray Serve deployment graph it would look like this:

Ray Serve deployment

Note that each of those tasks/actors can have fine grained resources allocation allowing for more efficiency utilization on each host without sacrificing latency (i.e., ray_actor_options={"num_cpus": 0.5}).

Ray Serve allows you to configure the number of replicas at each step of your pipeline. This allows you to autoscale your ML serving application in milliseconds granularly.

Autoscaling example

Microbatching (opens in a new tab) requests can also be implemented by using the @serve.batch decorator. This not only gives the developer a reusable abstraction, but also more flexibility and customization opportunity as part of their batching logic.

On the roadmap, Ray Serve will further optimize your compute for serving your models with zero copy load and model caching, which allows for loading large models in milliseconds or 340x times faster using Ray (opens in a new tab). Model caching will allow you to keep a pool of models in the Ray internal memory and allow you to hotswap with models for a given endpoint. This allows you to have many more models available than what your host can handle and allows you to optimize resources on your endpoint host depending on traffic, demand, or business rules.

For developers, the combination of a proven web server framework like FastAPI and the advanced ML serving capabilities of Ray Serve is a game-changer.

The Power of Ray Serve Deployment Graphs

Ray Serve is engineered to cater to intricate pipelines, ensemble, and business logic patterns. Its Deployment Graph API, currently in alpha, allows developers to construct scalable and flexible inference serving pipelines as DAGs that take advantage of Ray's compute for scaling.

Ray Serve stands apart with its:

Flexibility in scheduling.
Efficient communication.
Resource optimization through fractional allocation and shared memory.
Unified DAG API that streamlines the transition from local development to production.

In contrast to other frameworks that rely heavily on hand-written YAML, Ray Serve maintains a pythonic ethos with user-friendly API abstractions.

In Conclusion

Why settle when you can have the best of both worlds? Ray Serve and FastAPI together offer a compelling proposition for those keen on optimizing both their web services and ML serving capabilities.

Resources:

Ray Serve Ray Data