Cloud & MLOps ☁️
Deploying Ray & Ray Serve

Launching Ray Clusters on AWS

To start an AWS Ray cluster, you should use the Ray cluster launcher with the AWS Python SDK.

Using the Cluster Management CLI

Install Ray cluster launcher

The Ray cluster launcher is part of the ray CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as ray up, ray down and ray attach. You can use pip to install the ray CLI with cluster launcher support. Follow the Ray installation documentation for more detailed instructions.

# install ray
pip install -U ray[default]

Install and Configure AWS Python SDK (Boto3)

Next, install AWS SDK using pip install -U boto3 and configure your AWS credentials following the AWS guide (opens in a new tab). Boto3 will look in several locations when searching for credentials. The mechanism in which Boto3 looks for credentials is to search through a list of possible locations and stop as soon as it finds credentials. The order in which Boto3 searches for credentials is:

  1. Passing credentials as parameters in the boto.client() method
  2. Passing credentials as parameters when creating a Session object
  3. Environment variables
    • This is my preferred method
  4. Shared credential file (~/.aws/credentials)
  5. AWS config file (~/.aws/config)
  6. Assume Role provider
  7. Boto2 config file (/etc/boto.cfg and ~/.boto)
  8. Instance metadata service on an Amazon EC2 instance that has an IAM role configured.
# install AWS Python SDK (boto3)
pip install -U boto3

And then in the .env file:

# AWS Credentials

To create, modify, or delete your own access keys, In the navigation bar in the AWS console, on the upper right, choose your user name, and then choose Security credentials. There you can create an access key. Then to retrieve a temporary credential using AWS STS:

import boto3
from decouple import config
def get_aws_sts():
    Get AWS STS credentials
    access_key = config('AWS_ACCESS_KEY_ID')
    secret_access_key = config('AWS_SECRET_ACCESS_KEY')
    client = boto3.client(
    response = client.get_session_token()
    expiry = response['Credentials']['Expiration']
    print(f"Credentials expire at {expiry}")
    return response['Credentials']
if __name__ == '__main__':

Start Ray with the Ray cluster launcher

Once Boto3 is configured to manage resources in your AWS account, you should be ready to launch your cluster using the cluster launcher. The cluster config file (opens in a new tab) provided by Ray will create a small cluster with an m5.large head node (on-demand) configured to autoscale to up to two m5.large spot-instance workers. Test that it works by running the following commands from your local machine:

# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
ray up aws/cluster.yaml --no-config-cache
# Get a remote shell on the head node.
ray attach aws/cluster.yaml
# Try running a Ray program.
python -c 'import ray; ray.init()'
# Tear down the cluster.
ray down aws/cluster.yaml


By default, Ray nodes in a Ray AWS cluster have full EC2 and S3 permissions (i.e. arn:aws:iam::aws:policy/AmazonEC2FullAccess and arn:aws:iam::aws:policy/AmazonS3FullAccess). This is a good default for trying out Ray clusters but you may want to change the permissions Ray nodes have for various reasons (e.g. to reduce the permissions for security reasons). You can do so by providing a custom IamInstanceProfile to the related node_config:


Ray Serve: Kubernetes using the KubeRay RayService

For Ray Serve, it is recommended in the docs (opens in a new tab) to deploy it in production on Kubernetes, with the recommended practice to use the RayService (opens in a new tab) controller that’s provided as part of KubeRay (opens in a new tab). This setup provides the best of both worlds: the user experience and scalable compute of Ray Serve and operational benefits of Kubernetes. This also allows you to integrate with existing applications that may be running on Kubernetes. The RayService custom resource automatically handles important production requirements such as health checking, status reporting, failure recovery, and upgrades. If you’re not running on Kubernetes, you can also run Ray Serve on a Ray cluster directly using the Serve CLI. To do this, you will need to generate a Serve config file and deploy it using the Serve CLI.

A RayService Custom Resource (CR) encapsulates a multi-node Ray Cluster and a Serve application that runs on top of it into a single Kubernetes manifest. Deploying, upgrading, and getting the status of the application can be done using standard kubectl commands.

The Serve Config File

For apps we are building, in development, we would likely use the serve run command to iteratively run, develop, and repeat (see the Development Workflow (opens in a new tab) for more information). When we’re ready to go to production, we will generate a structured config file that acts as the single source of truth for the application.

You can use the Serve config with the serve deploy CLI command used to deploy on VM or embed it in a RayService custom resource in Kubernetes to deploy and update your application in production. The config is a YAML file with the following format:

proxy_location: ...
  host: ...
  port: ...
  request_timeout_s: ...
  keep_alive_timeout_s: ...
  port: ...
  grpc_servicer_functions: ...
- name: ...
  route_prefix: ...
  import_path: ...
  runtime_env: ...
  - name: ...
    num_replicas: ...
  - name:

The file contains proxy_location, http_options, grpc_options, and applications. See details about each field in the docs (opens in a new tab)

We can also auto-generate this config file from the code. The serve build command takes an import path to your deployment graph and it creates a config file containing all the deployments and their settings from the graph. You can tweak these settings to manage your deployments in production.

Note that the runtime_env field will always be empty when using serve build and must be set manually. In my case, if modin or QuantLib are not installed globally, you should include these two pip packages in the runtime_env.

This config file can be generated using serve build:

serve build app.main:app -o serve_config.yaml

For me, the generated config file looks like this:

# This file was generated using the `serve build` command on Ray v2.7.0.
proxy_location: EveryNode
  port: 8000
  port: 9000
  grpc_servicer_functions: []
  - name: app1
    route_prefix: /
    import_path: app.main:app
    runtime_env: {}
      - name: PGMaster
      - name: API

The generated version of this file contains an import_path, runtime_env, and configuration options for each deployment in the application. The application needs packages, so modify the runtime_env field of the generated config to include these two pip packages. Save this config locally in serve_config.yaml:

# This file was generated using the `serve build` command on Ray v2.7.0.
proxy_location: EveryNode
  port: 8000
  port: 9000
  grpc_servicer_functions: []
  - name: app1
    route_prefix: /
    import_path: app.main:app
        - QuantLib
        - asyncpg
        - numpy
        - modin
      - name: PGMaster
      - name: QuadraAPI

You can use serve deploy to deploy the application to a local Ray cluster and serve status to get the status at runtime:

# Start a local Ray cluster.
ray start --head
# Start the application.
serve deploy serve_config.yaml

And to stop the ray cluster:

# Stop the application.
serve shutdown
# Stop the local Ray cluster.
ray stop

To update the application, modify the config file and use serve deploy again.

Deploying on Kubernetes using KubeRay

KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. Read more in the docs (opens in a new tab). It offers 3 custom resource definitions (CRDs):

  • RayCluster: KubeRay fully manages the lifecycle of RayCluster, including cluster creation/deletion, autoscaling, and ensuring fault tolerance.
  • RayJob: With RayJob, KubeRay automatically creates a RayCluster and submits a job when the cluster is ready. You can also configure RayJob to automatically delete the RayCluster once the job finishes.
  • RayService: RayService is made up of two parts: a RayCluster and Ray Serve deployment graphs. RayService offers zero-downtime upgrades for RayCluster and high availability.

We will deploy a Ray Serve application using a RayService.

1. Create a Kubernetes cluster with Kind

First, create a Kubernetes cluster with Kind for local development:

kind create cluster --image=kindest/node:v1.23.0

2. Install the KubeRay operator

Install the KubeRay operator (opens in a new tab) via Helm repository.

$ helm repo add kuberay
$ helm repo update
# Install both CRDs and KubeRay operator v1.0.0-rc.0.
$ helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0-rc.0
# Confirm that the operator is running in the namespace `default`.
$ kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
kuberay-operator-68cc555c9-qc7cf   1/1     Running   0          22s

3. Set up a RayService Custom Resource (CR)

A RayService manages two components:

  • RayCluster: Manages resources in a Kubernetes cluster.
  • Ray Serve Applications: Manages users’ applications.

The Ray service is used to provide:

  • Kubernetes-native support for Ray clusters and Ray Serve applications: After using a Kubernetes config to define a Ray cluster and its Ray Serve applications, you can use kubectl to create the cluster and its applications.
  • In-place updates for Ray Serve applications: Users can update the Ray Serve config in the RayService CR config and use kubectl apply to update the applications.
  • Zero downtime upgrades for Ray clusters: Users can update the Ray cluster config in the RayService CR config and use kubectl apply to update the cluster. RayService temporarily creates a pending cluster and waits for it to be ready, then switches traffic to the new cluster and terminates the old one.
  • Services HA: RayService monitors the Ray cluster and Serve deployments' health statuses. If RayService detects an unhealthy status for a period of time, RayService tries to create a new Ray cluster and switch traffic to the new cluster when it's ready.

So, to manage the Ray Serve application, create and update a RayService CR. For a demo Service CR from the QuickStart, run this:

# Step 3.1: Download `ray_v1alpha1_rayservice.yaml`
curl -LO
# Step 3.2: Create a RayService
kubectl apply -f ray_v1alpha1_rayservice.yaml

kubectl apply creates the underlying Ray cluster, consisting of a head and worker node pod (see Ray Clusters Key Concepts (opens in a new tab) for more details on Ray clusters), as well as the service that can be used to query our application.

For custom apps, we will need to generate a serve config file and embed it in a RayService CR for Kubernetes to deploy and update the application in production. To understand this yaml file better, see the docs (opens in a new tab).

4. Check RayService Status

When the RayService is created, the KubeRay controller first creates a Ray cluster using the provided configuration. Then, once the cluster is running, it deploys the Serve application to the cluster using the REST API. The controller also creates a Kubernetes Service that can be used to route traffic to the Serve application.

# Step 4.1: List all RayService custom resources in the `default` namespace.
$ kubectl get rayservice
NAME                AGE
rayservice-sample   3m58s
# Step 4.2: List all RayCluster custom resources in the `default` namespace.
$ kubectl get raycluster
NAME                                 DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
rayservice-sample-raycluster-9gr8f   1                 1                   ready    4m44s
# Step 4.3: List all Ray Pods in the `default` namespace.
$ kubectl get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
ervice-sample-raycluster-9gr8f-worker-small-group-tnl77   1/1     Running   0          4m59s
rayservice-sample-raycluster-9gr8f-head-5d65p             1/1     Running   0          4m59s
# Step 4.4: List services in the `default` namespace.
$ kubectl get services
NAME                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                   AGE
kuberay-operator                              ClusterIP   <none>        8080/TCP                                                  7m33s
kubernetes                                    ClusterIP       <none>        443/TCP                                                   8m20s
rayservice-sample-head-svc                    ClusterIP    <none>        10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   91s
rayservice-sample-raycluster-9gr8f-head-svc   ClusterIP   <none>        10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   5m29s
rayservice-sample-serve-svc                   ClusterIP    <none>        8000/TCP                                                  91s

When the Ray Serve applications are healthy and ready, KubeRay creates a head service and a Ray Serve service for the RayService custom resource. For example, rayservice-sample-head-svc and rayservice-sample-serve-svc in Step 4.4. Note that the rayservice-sample-serve-svc is the one that can be used to send queries to the Serve application – this will be used in the next section.

Users can access the head Pod through both the head service managed by RayService (that is, rayservice-sample-head-svc) and the head service managed by RayCluster (that is, rayservice-sample-raycluster-6mj28-head-svc). However, during a zero downtime upgrade, a new RayCluster is created, and a new head service is created for the new RayCluster. If you don't userayservice-sample-head-svc, you need to update the ingress configuration to point to the new head service. However, if you use rayservice-sample-head-svc, KubeRay automatically updates the selector to point to the new head Pod, eliminating the need to update the ingress configuration.

5. Querying the Application

Once the RayService is running, we can query it over HTTP using the service created by the KubeRay controller. This service can be queried directly from inside the cluster, but to access it from your laptop you’ll need to configure a Kubernetes ingress or use port forwarding as below:

kubectl port-forward service/rayservice-sample-serve-svc 8000

Forward the dashboard port to localhost aswell, and check the Serve page in the Ray dashboard at http://localhost:8265/#/serve

kubectl port-forward svc/rayservice-sample-head-svc --address 8265:8265

For example, you can call the Fruit demo app with:

curl -X POST -H 'Content-Type: application/json' http://localhost:8000/fruit/ -d '["MANGO", 2]'
# Output: 6

The default ports and their definitions are:

6379Ray GCS
8265Ray Dashboard
10001Ray Client
8000Ray Serve
52365Ray Dashboard Agent

6. Getting the status of the application

As the RayService is running, the KubeRay controller continually monitors it and writes relevant status updates to the CR. You can view the status of the application using kubectl describe. This includes the status of the cluster, events such as health check failures or restarts, and the application-level statuses reported by serve status:

$ kubectl get rayservices
NAME                AGE
rayservice-sample   7m59s
$ kubectl describe rayservice rayservice-sample
Name:         rayservice-sample
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:
  Type    Reason                       Age                     From                   Message
  ----    ------                       ----                    ----                   -------
  Normal  Running                      3m45s (x13 over 4m6s)   rayservice-controller  The Serve applicaton is now running and healthy.

7. Updating the application

To update the RayService, modify the manifest and apply it use kubectl apply. There are two types of updates that can occur:

  • Application-level updates: when only the Serve config options are changed, the update is applied in-place on the same Ray cluster. This enables lightweight updates such as scaling a deployment up or down or modifying autoscaling parameters.
  • Cluster-level updates: when the RayCluster config options are changed, such as updating the container image for the cluster, it may result in a cluster-level update. In this case, a new cluster is started, and the application is deployed to it. Once the new cluster is ready, the Kubernetes service is updated to point to the new cluster and the previous cluster is terminated. There should not be any downtime for the application, but note that this requires the Kubernetes cluster to be large enough to schedule both Ray clusters.

In the Text ML example, change the language of the Translator in the Serve config to German:

- name: Translator
  num_replicas: 1
    language: german

Now to update the application we apply the modified manifest:

kubectl apply -f ray-service.text-ml.yaml
kubectl describe rayservice rayservice-sample

The process of updating the RayCluster config is the same as updating the Serve config. For example, we can update the number of worker nodes to 2 in the manifest:

  # the number of pods in the worker group.
  - replicas: 2

8. Clean up

Clean up the Kubernetes cluster

# Delete the RayService.
kubectl delete -f ray_v1alpha1_rayservice.yaml
# Uninstall the KubeRay operator.
helm uninstall kuberay-operator
# Delete the curl Pod.
kubectl delete pod curl

Deploy on VM

You can deploy your Serve application to production on a Ray cluster using the Ray Serve CLI. serve deploy takes in a config file path and it deploys that file to a Ray cluster over HTTP. This could either be a local, single-node cluster or a remote, multi-node cluster started with the Ray Cluster Launcher. See more in the Docs (opens in a new tab)