Cloud & MLOps ☁️
Debugging Failures

Debugging

Within the context of Ray, debugging an application refers to failures that emerge during remote processes. This involves the interplay of two main APIs (which have an error handling model very similar (opens in a new tab) to standard Python future APIs):

  • .remote - Creates a task or actor and starts a remote process; will return an exception if the remote process fails.
  • .get - Retrieves the result from an object reference; raises an exception if the remote process failed.

The exceptions APIs (opens in a new tab) can be grouped into a framework of three primary failure modes:

  1. Application failures - This happens when a remote task or actor fails resulting from errors in user-generated code. Exceptions thrown include RayTaskError and RayActorError.

  2. Intentional system failures - These indicate that while Ray has failed, this failure is a deliberate action. Common examples include using cancellation APIs such as ray.cancel for tasks or ray.kill for actors.

  3. Unintended system failures - These arise when a remote process has failed due to unforeseen system failures, such as process crashes or node failures. The following are typical cases: - The out of memory killer randomly terminates processes. - The machine is being terminated, particularly in the case of spot instances. - The system being highly overloaded or stressed leading to failure. - Bugs within Ray Core (relatively infrequent).

Each of these failures necessitates a specialized approach. In the following section, we focus on an out of memory (OOM) example. Feel free to refer to the debugging user guides (opens in a new tab) for further information on the other types of failures.

Example: Out of Memory Errors

In the context of distributed computing, an out of memory (OOM) error occurs when a node in a cluster tries to use more memory than the amount available, leading to a failure of the entire system. Ray applications use memory (opens in a new tab) in two main ways:

  1. System memory: Memory used internally by Ray to manage resources and processes.
  2. Application memory: Memory used by your application including creating objects in the object store with ray.put and reading them with ray.get; objects will spill to disk (opens in a new tab) if the store fills up.

Consider the following example which continuously leaks memory by appending gigabyte arrays of zeros until the node runs out of memory:

Warning: This script is designed to cause a memory leak and lead to an out of memory (OOM) error. Use caution when executing this code and only run it in a controlled environment.

running = False  # Set to True to run the memory leaker.
 
@ray.remote(max_retries=0)
def memory_leaker():
    chunks = []
    bytes_per_chunk = 1024 * 1024 * 1024  # 1 gigabyte.
    while running:
        chunks.append([0] * bytes_per_chunk)
        time.sleep(5)  # Delay to observe the leak.
 
ray.get(memory_leaker.remote())

In this script, I define a Ray Task that leaks memory by continuously appending one-gigabyte arrays of zeros to a list. Depending on your resource configurations, this will eventually cause the memory usage to continuously increase until the OOM killer throws the following error message:

ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.

Note: By default, if a worker dies unexpectedly, Ray will rerun the process (opens in a new tab) up to 3 times. You can specify the number of max_retries in the ray.remote decorator (e.g. 0 to disable retries, -1 for infinite retries).

Effective use of observability tools is essential for preventing OOM incidents (opens in a new tab). While you may be familiar with using htop (opens in a new tab) or free -h (opens in a new tab) to gather snapshots of system's running processes, these provide limited utility in distributed systems. Ideally, you want a tool that summarizes resource utilization and usage per node and worker without having to specify each one individually. Through the Ray Dashboard (opens in a new tab), you can view a live feed of vital information about your cluster and application during OOM events:

  • Nodes: Offers a snapshot of memory usage per node and worker.
  • Metrics: Shows historical usage on an active cluster via Prometheus and Grafana; comes built-in with the Anyscale console.
  • Logs: Accessible, searchable, and filterable via Node and Logs view.

Example: Hanging errors.

Another common error that Ray users encounter is hang, which refers to the situation where an object reference created by .remote() cannot be retrieved by .get(). Hanging can cause delays or failure. You can detect these issues in the following ways:

  1. State API
  • ray status - Pending tasks and actors will surface in the "Demands" summary.
  1. Ray Dashboard
  • Progress Bar - Check the status (opens in a new tab) of tasks and actors.
  • Metrics - Visualize the state of tasks and actors over time.

Consider the following example that reproduces a lightweight hang (not an indefinite, failing hang error) due to the nature of the dependencies:

@ray.remote
def long_running_task():
    time.sleep(random.randint(10, 60))
 
@ray.remote
def dependent_task(dependencies: list[ray._raylet.ObjectRef]):
    ray.get(dependencies)
 
dependencies = [long_running_task.remote() for _ in range(100)]
dependent_task.remote(dependencies)

This materializes as follows in the Ray Dashboard:

Hang

Here, the dependent_task must wait until the dependencies (100 long_running_tasks) return in order to resolve. Using the State API and Ray Dashboard, you can track the progress of each task and subtask. While this example represents an engineered hanging, typically you will find a handful of common causes for encountering hang:

  1. Application bugs
  2. Resource constraints
    • Waiting for available resources, which can be complicated by placement groups.
  3. Object store memory insufficient
    • There's not enough memory to pull objects efficiently to the local node.
  4. Pending upstream dependencies
    • Dependencies may not be scheduled yet or they are still running.

Summary

Debugging Ray applications involves detecting errors, accessing snapshots of current usage, and analyzing historical data to hone in on the issue. These workflows can be facilitated by using observability tools, especially the Ray Dashboard which acts as a central repository for collecting metrics and logs about a Ray cluster.