Levels of Ray Integration

Ray's integration with various libraries can be categorized into different levels or patterns, each solving unique challenges for both library authors and end-users.

Level 1: Scheduling Only, Out-of-Band Communication

Level 1

Libraries at this level, such as Horovod and Pytorch Lightning, use Ray for high-level scheduling but handle communications externally. This is often the right choice for libraries with mature communication stacks but needing additional features like fine-grained control over worker scheduling and fault tolerance.

# Example: Out-of-band communication in Ray
# Initialize the actors.
workers = [MPIActor.remote(...) for _ in range(10)]
# ...

Level 2: Scheduling and Communication via Ray

Level 2

Libraries like Hugging Face and Scikit-Learn rely on Ray for both distributed scheduling and communications i.e. Ray task and actor calls. This level is ideal for libraries requiring low-latency communication, parallel computation & the ability to coordinate a complex topology of actors or tasks (e.g., for implementing reinforcement learning, online decision system, or model serving pipeline).

# Task example.
results = [evaluate.remote(latest_params) for arg in work_to_do]
ray.get(results)
 
# Actor example.
workers = [Actor.remote() for _ in range(5)]
for w in workers:
    w.update(latest_params)
results = [w.evaluate.remote() for w in workers]
ray.get(results)

Under the hood, Ray is translating task and actor invocations into low-level gRPC calls (opens in a new tab).

Level 3: Scheduling, Communication, and Distributed Memory

Level 3

XGBoost, Modin, Dask and other libraries at this level fully integrate with Ray, including its distributed object store features (providing first-class object references (opens in a new tab), object spilling for large-scale data processing (opens in a new tab) workloads, shared-memory support (opens in a new tab) so large objects can be shared by multiple workers on the same machine without any copies, etc).

# Store a large dataset in the object store.
data_R = [ray.put(block) for block in large_data_blocks]
 
# Store a small dataset in the object store.
data_S = ray.put(small_data)
 
# Example of implementing broadcast join between R and S using tasks.
joined_R_S = [join.remote(R_i, data_S) for R_i in data_R]

Ray's built-in libraries such as Tune, RLlib, and Serve are also Level 3 library examples, leveraging the object store to provide best-in-class performance and flexibility.

Choosing the Right Level of Integration

While it might seem that "more integration is better," the rule of least power (opens in a new tab) suggests opting for the minimal level of integration that satisfies the library's needs

Resources:

Ray Distributed Library Patterns (opens in a new tab)

Distributed Python with Ray Remote Functions as Tasks