What is Ray?

ℹ️

Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads.

Ray provides the compute layer to scale applications without becoming a distributed systems expert; it is a simplified and flexible framework for distributed computation.These are some key processes that Ray automatically handles:

Orchestration. Managing the various components of a distributed system.
Scheduling. Coordinating when and where tasks are executed.
Fault tolerance. Ensuring tasks complete regardless of inevitable points of failure.
Auto-scaling. Adjusting the number of resources allocated to dynamic demand.

To lower the effort needed to scale compute intensive workloads, Ray takes a Python-first approach and integrates with many common data science tools. This allows ML practitioners to parallelize Python applications from a laptop to a cluster with minimal code changes.

Key Ray characteristics

Ray has several key characteristics that make it a powerful tool for scaling Python applications:

Python first approach
Simple and flexible API
Scalability
Support for heterogeneous hardware

Python first approach

Ray allows you to flexibly compose distributed applications with easy to use primitives in native Python code. This way, you can scale your existing workloads with minimal code changes. Getting started with Ray Core involves just a few key abstractions:

Tasks (opens in a new tab). Remote, stateless Python functions
Actors (opens in a new tab). Remote, stateful Python classes
Objects (opens in a new tab). Tasks and actors create and compute on objects that can be stored and accessed anywhere in the cluster; cached in Ray's distributed shared-memory (opens in a new tab) object store

You can learn more about these abstractions in the Ray Core tutorials (opens in a new tab).

Simple and flexible API

Ray Core

ℹ️

Ray Core (opens in a new tab) is an open-source, Python, general purpose, distributed computing library that enables ML engineers and Python developers to scale Python applications and accelerate machine learning workloads.

Acting as the foundational library for the whole ecosystem, Ray Core provides a minimalist API that enables distributed computing. With just a few methods, you can start building distributed apps:

ray.init()
Start Ray runtime and connect to the Ray cluster.
@ray.remote
Decorator that specifies a Python function or class to be executed as a task (remote function) or actor (remote class) in a different process.
.remote
Postfix to the remote functions and classes; remote operations are asynchronous.
ray.put()
Put an object in the in-memory object store; returns an object reference used to pass the object to any remote function or method call.
ray.get()
Get a remote object(s) from the object store by specifying the object reference(s).

Ray is designed to be both simple and flexible. It allows for easy distribution of functions and classes via straightforward annotations (e.g., @ray.remote). Unlike traditional batch models, Ray employs the actor model (opens in a new tab) to facilitate dynamic creation of tasks and actors, enabling a more versatile computational topology.

Ray AI Runtime (AIR)

ℹ️

Ray AI Runtime (AIR) (opens in a new tab) is an open-source, Python, domain-specific set of libraries that equips ML engineers, data scientists, and researchers with a scalable and unified toolkit for ML applications.

Ray AI Runtime (AIR) (sometimes referred to as native libraries and ecosystem of integrations) provides higher level APIs that cater to more domain-specific use cases. Ray AIR enables data scientists and ML engineers to scale individual workloads, end-to-end workflows, and popular ecosystem frameworks, all in Python.

Scalability

Ray allows users to utilize large compute clusters in an easy, productive, and resource-efficient way. Fundamentally, Ray treats the entire cluster as a single, unified pool of resources and takes care of optimally mapping compute workloads to the pool. By doing so, Ray largely eliminates non-scalable factors in the system.

Some examples of successful user stories include the following:

Instacart (opens in a new tab) uses Ray to power their large scale fulfillment ML pipline.
OpenAI (opens in a new tab) trains their largest models (including ChatGPT).
Companies like HuggingFace and Cohere (opens in a new tab) use Ray Train for scaling model training.

A notable strength is Ray's autoscaler (opens in a new tab) implements automatic scaling of Ray clusters based on the resource demands of an application. The autoscaler will increase worker nodes when the Ray workload exceeds the cluster's capacity. Whenever worker nodes sit idle, the autoscaler will scale them down.

Support for heterogeneous hardware

Heterogeneous systems present new challenges to distribution because each compute unit has its own programming model. Ray natively supports heterogeneous hardware to achieve load balancing, coherency, and consistency under the hood. All you need to do is specify hardware when initializing a task or actor. For example, a developer can specify in the same application that a one task needs 128 CPUs, another task only requires 0.5 GPUs, and an actor requires 36 CPUs and 8 GPUs.

# specify resources when starting a Cluster
ray.init(num_cpus=128)
 
# specify resources for a compute task
@ray.remote(num_cpus=36, num_gpus=8)
def train_and_score_mode1():
  RoBERTa_pretrained = ...
  ...
  return score
 
# specify fractional GPUS
@ray.remote(num_gpus=®.5)
def tiny_ml():
  EfficientNet = load()
  ...
  return acc
 
# specify custom resources
@ray.remote(resources={”custom": 1})
  def train_optimized():
    load_data = ...
    ...
    return mode1_va1

Here, we easily specify amount of resources needed, by using num_cpus and num_gpus. An illustrative example is the production deep learning pipeline at Uber (opens in a new tab). A heterogeneous hardware setup of 8 GPU nodes and 9 CPU nodes improves the pipeline throughput by 50%, while substantially saving capital cost, compared with the legacy setup of 16 GPU nodes.

Deep learning pipeline at Uber using a heterogeneous hardware setup with Ray

Further, Uber, in a recent article (opens in a new tab), unveiled how they envision the third generation of their Michelangelo architecture using Ray in Ludwig 0.4.

Uber's Ludwig Architecture Illustration

Building Uber's 3rd Generation Michelangelo Architecture (Source: Uber's 2021 Blog Post (opens in a new tab))

Uber highlighted that Ray enabled them to treat their training pipeline as a single, intuitive script. Notably, the emphasis was on the necessity for a standardized approach to crafting ML libraries, both within Uber and industry-wide.

Robovision & Ray

For a tangible example of Ray being used, consider Robovision, a company that aimed to implement vehicle detection using five layered ML models. While they initially could only house their models within a single GPU or machine, Ray's introduction led to a threefold performance boost on the same hardware, as depicted below.

Robovision's Pipeline

With Ray, Robovision wrapped various tasks using Ray actors, leading to enhanced flexibility and direct GPU access. Implementing such a multi-layered model becomes straightforward with Ray.

@ray.remote
Class Model:
   def __init__(self, next_actor):
      self.next = next_actor
 
   def runmodel(self, inp):
      out = process(inp); # different for each stage
      self.next.call.remote(out)
 
# input_stream -> object_detector ->
# object_tracker -> speed_calculator -> result_collector
 
result_collector = Model.remote()
speed_calculator = Model.remote(next_actor=result_collector)
object_tracker = Model.remote(next_actor=speed_calculator)
object_detector = Model.remote(next_actor=object_tracker)
 
for inp in input_stream:
   object_detector.runmodel.remote(inp)

The code above illustrates the implementation of a sophisticated model using Ray, demonstrating how one can easily transition from a local machine to a distributed setting with Ray.

Advancements in ML Tools

The ML domain has seen rapid growth. Today, we have an array of tools, best practices, and established processes, including feature store, ML monitoring, responsible AI, and workflow orchestration. Yet, many organizations grapple with creating tailored ML CI/CD pipelines to deploy their models.

Specialized ML serving solutions are also seeing evolution. According to surveys by anyscale (opens in a new tab) and O’Reilly (opens in a new tab), we see almost an even split between users opting for Python-based web servers such as FastAPI and those preferring specialized ML serving solutions.

Survey Results from AnyScale

A web server like FastAPI might be the go-to for infrastructure engineers, while ML engineers might favor specialized ML serving frameworks. One underlying reason for these varied preferences might be the question of "Who handles the end-stage deployment?"

Several questions arise:

Should ML experts be tasked with mastering Docker and Kubernetes?
Is it fair to expect backend engineers to delve into ML model deployment intricacies?
Is there a need for all-encompassing ML engineer 'rockstars' for production deployment?
How does one translate ensemble models into microservices?
Can ML and infrastructure engineers collaboratively code atop YAML files?

Though AI is becoming pivotal for businesses, MLOps and production model deployment remain challenges.

However, Ray promises to streamline your MLOps journey. Ray Serve also delineates the boundary between crafting a real-time multi-model pipeline and its deployment. Later, we'll go into how Ray Serve and FastAPI harmoniously meld the two realms: the conventional web server and specialized ML serving solutions.

What is MLOps Ray libraries