Cloud & MLOps ☁️
Levels of Ray Integration

Levels of Ray Integration

Ray's integration with various libraries can be categorized into different levels or patterns, each solving unique challenges for both library authors and end-users.

Level 1: Scheduling Only, Out-of-Band Communication

Level 1

Libraries at this level, such as Horovod and Pytorch Lightning, use Ray for high-level scheduling but handle communications externally. This is often the right choice for libraries with mature communication stacks but needing additional features like fine-grained control over worker scheduling and fault tolerance.

# Example: Out-of-band communication in Ray
# Initialize the actors.
workers = [MPIActor.remote(...) for _ in range(10)]
# ...

Level 2: Scheduling and Communication via Ray

Level 2

Libraries like Hugging Face and Scikit-Learn rely on Ray for both distributed scheduling and communications i.e. Ray task and actor calls. This level is ideal for libraries requiring low-latency communication, parallel computation & the ability to coordinate a complex topology of actors or tasks (e.g., for implementing reinforcement learning, online decision system, or model serving pipeline).

# Task example.
results = [evaluate.remote(latest_params) for arg in work_to_do]
ray.get(results)
 
# Actor example.
workers = [Actor.remote() for _ in range(5)]
for w in workers:
    w.update(latest_params)
results = [w.evaluate.remote() for w in workers]
ray.get(results)

Under the hood, Ray is translating task and actor invocations into low-level gRPC calls (opens in a new tab).

Level 3: Scheduling, Communication, and Distributed Memory

Level 3

XGBoost, Modin, Dask and other libraries at this level fully integrate with Ray, including its distributed object store features (providing first-class object references (opens in a new tab), object spilling for large-scale data processing (opens in a new tab) workloads, shared-memory support (opens in a new tab) so large objects can be shared by multiple workers on the same machine without any copies, etc).

# Store a large dataset in the object store.
data_R = [ray.put(block) for block in large_data_blocks]
 
# Store a small dataset in the object store.
data_S = ray.put(small_data)
 
# Example of implementing broadcast join between R and S using tasks.
joined_R_S = [join.remote(R_i, data_S) for R_i in data_R]

Ray's built-in libraries such as Tune, RLlib, and Serve are also Level 3 library examples, leveraging the object store to provide best-in-class performance and flexibility.

Choosing the Right Level of Integration

While it might seem that "more integration is better," the rule of least power (opens in a new tab) suggests opting for the minimal level of integration that satisfies the library's needs


Resources: