Ray Observability

Introduction

Effective debugging and optimizing is crucial in any system, but it becomes even more vital in distributed settings (opens in a new tab). With a high number of inter-connected, heterogeneous components, updates can cause opaque failures, making it difficult to identify and fix issues. The goal of observability is to allow for the monitoring and understanding of the internal state of the system in real-time to ensure its stability and performance.

Within the context of Ray, observability (opens in a new tab) refers to the extent of visibility into distributed applications along with the available tools for inspecting and aggregating performance data.

Why is observability challenging in distributed systems?

A Ray cluster (opens in a new tab) consists of a head node that manages a large number of worker nodes which execute the code of an application. As the scale of the cluster increases, so does the number of tasks (opens in a new tab), actors (opens in a new tab), and objects (opens in a new tab) that are concurrently executed and stored across heterogeneous machines (opens in a new tab).

Actor failure in a Ray cluster with millions of concurrent tasks among thousands of worker nodes

Debugging issues in this environment can be challenging. Among thousands of nodes and tens of thousands of actors (opens in a new tab), performance bottlenecks, failures, and unpredictable behavior are inevitable. For example, diagnosing a cluster with thousands of actors and millions of processes could contain any number of non-trivial pitfalls:

How do I know if an actor has failed, especially if the actor failed to initialize in the first place?
- Some processes may become stuck indefinitely due to incomplete scheduling, known as "hang."
When I become aware of actor failure(s), how do I know which one(s) caused the issue?
- In a set-up of tens of thousands of actors, how do I begin to identify the culprit?
Once I know which actor(s) to inspect, how can I find the log and fix the bug?
- Filtering through logs and tracing failures is impossible without robust tooling.

The Ray runtime (opens in a new tab) manages much of the low-level system behavior in a Ray application which poses a unique opportunity to offer built-in performance data. By providing the right tools to successfully debug, optimize, and monitor Ray applications, developers can troubleshoot issues to improve a system's overall reliability and efficiency.

Ray observability toolbox

Ray offers observability tooling and third-party integrations at each layer of development that allow you to understand the Ray cluster, Ray application, and ML application. Ray's observability stack is as follows:

Ray observability stack

Layer	Tooling	Purpose	Scenarios
Ray Cluster	State API, Ray Dashboard	Infrastructure observability, like `htop` (opens in a new tab) for a Ray cluster. Monitor the status and utilization including hardware (CPUs, GPUs, TPUs), network, and memory.	Which nodes in my Ray cluster are experiencing high CPU or memory usage so that I can optimize consumption and reduce costs?
Ray Application (Core and AIR)	State API, Ray Dashboard, Profiler, Debugger	Debug, optimize, and monitor Ray applications including the status of tasks, actors, objects, placement groups, jobs, and more.	Are my thousands of in-flight tasks and actors progressing normally, or are some failing or hanging in unintended ways? If so, which ones, and how can I access the logs easily for any given node?
ML Application	Interactive Development: Weights & Biases (opens in a new tab), MLflow (opens in a new tab), Comet (opens in a new tab) Production: Ray Dashboard (opens in a new tab), Arize (opens in a new tab), WhyLabs (opens in a new tab)	Monitor ML models in interactive development and production through third-party integrations.	Which hyperparameters are optimal for my model? How does model performance vary over time or across different input data? Are there any anomalies in the model’s behavior in production?

The two main observability tools in Ray are:

The State API

State APIs (opens in a new tab) allow users to access the current state of resources of Ray through CLI (opens in a new tab) or Python SDK (opens in a new tab).

Resources include Ray tasks, actors, objects, placement groups, and more.
States refer to the immutable metadata (e.g. actor's name) and mutable states (e.g. actor's scheduling state or pid) of Ray resources.

There are three main APIs that allow you to inspect cluster resources with varying levels of granularity:

summary returns a summarized view of a given resource (i.e. tasks, actors, objects).
list returns a list of resources filterable by type and state.
get returns information about a specific resource in detail.

In addition, you can also easily retrieve and filter through ray logs (opens in a new tab):

logs returns the logs of tasks, actors, workers, or system log files.

!ray summary tasks

Will return:

======== Tasks Summary: 2023-09-18 10:39:37.407067 ========
Stats:
------------------------------------
total_actor_scheduled: 0
total_actor_tasks: 0
total_tasks: 1


Table (group by func_name):
------------------------------------
    FUNC_OR_CLASS_NAME    STATE_COUNTS    TYPE
0   task                  RUNNING: 1      NORMAL_TASK

!ray list actors

Will return

======== List: 2023-09-18 10:39:46.320557 ========
Stats:
------------------------------
Total: 1

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME    STATE      JOB_ID  NAME    NODE_ID                                                     PID  RAY_NAMESPACE
 0  8d361f9d3aa156f33bce4efc01000000  Actor         ALIVE    01000000          2d36ebc49acb252f8812095b272c2f354f36e73144d5e0d813be9fe4  27288  b7d3a49f-83e5-4191-8a8b-0fc068ff3384

Fetching Cluster Information

Many methods return information:

Method	Brief Description
`ray.get_gpu_ids()` (opens in a new tab)	GPUs
`ray.nodes()` (opens in a new tab)	Cluster nodes
`ray.cluster_resources()` (opens in a new tab)	All the available resources, used or not
`ray.available_resources()` (opens in a new tab)	Resources not in use

You can see the full list of methods in the Ray Core (opens in a new tab) API documention.

print(f"""
ray.get_gpu_ids():          {ray.get_gpu_ids()}
ray.nodes():                {ray.nodes()}
ray.cluster_resources():    {ray.cluster_resources()}
ray.available_resources():  {ray.available_resources()}
""")

Returns:

ray.get_gpu_ids():          []
ray.nodes():                [{'NodeID': '7e751b18012c59880d7a04e848abdff4701df8bd06485a5329a8e4bd', 'Alive': True, 'NodeManagerAddress': '127.0.0.1', 'NodeManagerHostname': 'DESKTOP-XXX', 'NodeManagerPort': 56193, 'ObjectManagerPort': 56191, 'ObjectStoreSocketName': 'tcp://127.0.0.1:62742', 'RayletSocketName': 'tcp://127.0.0.1:61338', 'MetricsExportPort': 61928, 'NodeName': '127.0.0.1', 'alive': True, 'Resources': {'node:__internal_head__': 1.0, 'object_store_memory': 5619240960.0, 'memory': 11238481920.0, 'node:127.0.0.1': 1.0, 'CPU': 16.0}, 'Labels': {}}]
ray.cluster_resources():    {'CPU': 16.0, 'object_store_memory': 5619240960.0, 'memory': 11238481920.0, 'node:127.0.0.1': 1.0, 'node:__internal_head__': 1.0}
ray.available_resources():  {'CPU': 16.0, 'object_store_memory': 5619240960.0, 'node:127.0.0.1': 1.0, 'memory': 11238481920.0, 'node:__internal_head__': 1.0}

Ray Dashboard

Ray records and emits time-series metrics using the Prometheus format. Ray does not provide a native storage solution for metrics. Users need to manage the lifecycle of the metrics by themselves. This page provides instructions on how to collect and monitor metrics from Ray Clusters.

Embedding Grafana visualizations into Ray Dashboard

To view embedded time-series visualizations in Ray Dashboard, the following must be set up:

The head node of the cluster is able to access Prometheus and Grafana
The browser of the dashboard user is able to access Grafana.

Configure these settings using the RAY_GRAFANA_HOST, RAY_PROMETHEUS_HOST, and RAY_GRAFANA_IFRAME_HOST environment variables when you start the Ray Clusters.

Set RAY_GRAFANA_HOST to an address that the head node can use to access Grafana. Head node does health checks on Grafana on the backend.
Set RAY_PROMETHEUS_HOST to an address the head node can use to access Prometheus.
Set RAY_GRAFANA_IFRAME_HOST to an address that the user’s browsers can use to access Grafana and embed visualizations. If RAY_GRAFANA_IFRAME_HOST not set, Ray Dashboard uses the value of RAY_GRAFANA_HOST.

For example, if the IP of the head node is 55.66.77.88 and Grafana is hosted on port 3000. Set the value to RAY_GRAFANA_HOST=55.66.77.88:3000.

If all the environment variables are set properly, you should see time-series metrics in Ray Dashboard.

In notebooks for dev, you can set the environment variables in the notebook before starting Ray. For example:

os.environ['RAY_GRAFANA_HOST'] = '127.0.0.1:8265'
os.environ['RAY_PROMETHEUS_HOST'] = '127.0.0.1:8265'
os.environ['RAY_GRAFANA_IFRAME_HOST'] = '127.0.0.1:8265'

Setting up Prometheus

Use Promtheus to scrape metrics from Ray Clusters. Ray doesn’t start Prometheus servers for users. Users need to decide where to host and configure it to scrape the metrics from Clusters.

Ray provides a Prometheus config that works out of the box. After running Ray, you can find the config at /tmp/ray/session_latest/metrics/prometheus/prometheus.yml. You can choose to use this config or modify your own config to enable this behavior. You need to start prometheus and grafana with the config files provided by Ray so that:

prometheus can scrape the metrics from the ray cluster properly
grafana can talk to the prometheus and visualize the metrics with the template dashboard provided by Ray

See the details of the config below. Find the full documentation here (opens in a new tab). With this config, Prometheus uses prom_metrics_service_discovery.json provided by Ray (opens in a new tab) to dynamically find the endpoints to scrape, which ray auto-generates and updates.

# Prometheus config file
 
# my global config
global:
  scrape_interval: 2s
  evaluation_interval: 2s
 
# Scrape from Ray.
scrape_configs:
  - job_name: "ray"
    file_sd_configs:
      - files:
          - "/tmp/ray/prom_metrics_service_discovery.json"

I ended up using the following to get it working cause of some issues on windows:

cp -r C:/Users/Shav/AppData/Local/Temp/ray/prom_metrics_service_discovery.json .
./prometheus.exe --config.file=prometheus.yml

This starts Prometheus with a ray configuration and exposes it on port 9090.

Setting up Grafana

I faced similar issues Creating a new Grafana server with the provided dashboards. The docs say to run:

./bin/grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web

But what worked for me was:

CWD=$(pwd)
cd /tmp/ray
# Get the latest folder
FOLDER=$(ls -td */ | head -1)
cd $FOLDER/metrics/grafana/
# replace "provisioning = ..." with "provisioning = "C:/Users/Shav/AppData/Local/Temp/ray/$FOLDER/metrics/grafana/provisioning"
sed -i 's#provisioning = .*#provisioning = "C:/Users/Shav/AppData/Local/Temp/ray/'$FOLDER'/metrics/grafana/provisioning"#' grafana.ini
cat grafana.ini
# Back to CWD
cd $CWD/grafana/
./bin/grafana-server --config C:/Users/Shav/AppData/Local/Temp/ray/$FOLDER/metrics/grafana/grafana.ini web

To run Grafana on port 3000. You can then access Grafana at http://localhost:3000 and login with the default credentials admin:admin. See the default dashboard in Grafana:

Grafana Dashboard

The same metric graphs are accessible in Ray Dashboard after you integrate Grafana with Ray Dashboard.

Ray Dashboard UI

The Ray Dashboard (opens in a new tab) offers a built-in mechanism for viewing the state of cluster resources, time series (opens in a new tab) metrics, and other features at a glance. Information made available via the terminal with the State API is also available to view in the Dashboard UI.

Ray Dashboard UI

The overview page of the dashboard offers live monitoring and quick links to common views such as metrics, nodes, jobs, and events related to Ray job submission APIs (opens in a new tab) and the Ray autoscaler (opens in a new tab). Along the top navigation bar, you will find five viewing displays:

Jobs - View the status and logs of Ray jobs (opens in a new tab).
Cluster - View state, resource utilization, and logs for each node and worker.
Actors - View information about the actors that have existed on the Ray cluster.
Metrics - View time series metrics that automatically refresh every 15 seconds; requires Prometheus (opens in a new tab) and Grafana (opens in a new tab) running for your cluster.
Logs - View all logs, organized by node and file name; supports filter and search.

Each view offers plenty of categories to monitor, and you can refer to the dashboard references (opens in a new tab) for a more complete description.

Ray Data Debugging Failures