Cloud & MLOps ☁️
Optimising Performance Bottlenecks

Optimizing

Within the context of Ray, optimization refers to the process of improving the speed and efficiency of an application. Distributed systems can be costly to run and maintain due to multiple interconnected components needing to work together consistently and reliably. By optimizing performance, organizations can achieve better results and reduce costs.

Ray has two common types of optimization issues: resource constraints and design anti-patterns.

  1. Resource constraints
  2. Design anti-patterns
    • Small tasks: Ray introduces overhead for managing each task, so if the task takes less than that overhead (around 10 milliseconds), you are likely to see worse performance.
    • Variable durations: Calling ray.get on a batch of tasks with varying durations limits performance to the slowest task.
    • Multi-threaded libraries: When all tasks compete for all resources, you experience contention that prevents runtime improvements.
    • And much more (opens in a new tab).

Each use case comes with its own unique set of challenges to troubleshoot (opens in a new tab). In the following section, you will walk through a known design anti-pattern and practice using the relevant observability tools to profile the bottleneck.

Anti-pattern using ray.get

Consider a scenario where you have a batch of independent tasks that are submitted at the same time. Each task can take a variable amount of time to complete. That is, one task may be completed quickly while another may take a long time.

In this anti-pattern (opens in a new tab), if you were to call ray.get() on the entire batch, your performance would be limited by the longest running task.

The corresponding anti-pattern code is shown below. Pay special attention to the progress bar for tasks as well as the timeline view:

@ray.remote
def sleep_task(i: int) -> int:
    time.sleep(i)
    return i
 
 
def post_processing_step(new_val: int):
    time.sleep(0.5)
 
 
big_sleep_times = [25]
small_sleep_times = [random.random() for _ in range(20)]
SLEEP_TIMES = big_sleep_times + small_sleep_times
 
# Launch remote tasks
refs = [sleep_task.remote(i) for i in SLEEP_TIMES]
for ref in refs:
    # Blocks until this ObjectRef is ready.
    result = ray.get(ref)  # Retrieve result in submission order.
    post_processing_step(result)  # Process the result.

Design pattern using ray.wait()

Instead of calling ray.get() on a batch with variable tasks, use ray.wait() (opens in a new tab) to get the first object reference ready to return and then use ray.get() to retrieve the result. In this way, you can apply the post-processing function as soon as a result becomes available instead of having the submission order potentially slow down the pipeline.

A solution to the anti-pattern is shown below.

# Launch remote tasks.
refs = [sleep_task.remote(i) for i in SLEEP_TIMES]
unfinished = refs
while unfinished:
    # Returns the first ObjectRef that is ready.
    finished, unfinished = ray.wait(unfinished, num_returns=1)
    # Retrieve the first ready result.
    result = ray.get(finished[0])
    # Process the result.
    post_processing_step(result)

Summary

The Ray Dashboard offers a range of tools to help you optimize performance and identify performance bottlenecks. It provides a timeline view, progress bar for tasks, and other profiling tools to assist you in improving the speed and efficiency of your application. With these tools, you can analyze different aspects of your application, identify areas for improvement, and resolve any performance issues, ultimately reducing costs.