Cloud & MLOps ☁️
Ray
Ray libraries

Ray libraries

Here is the stack of Ray libraries:

Stack of Ray libraries - unified toolkit for ML workloads

There are four layers to Ray's unified compute framework:

  1. Ray cluster
  2. Ray Core
  3. Ray AI Runtime (native libraries)
  4. Integrations and ecosystem

Ray cluster

ℹ️

Ray cluster (opens in a new tab) is a set of worker nodes connected to a common Ray head node. Ray clusters can be fixed-size, or they can autoscale up and down according to the resources requested by applications running on the cluster.

Starting at the bottom with a cluster (opens in a new tab), Ray sets up and manages cloud-agnostic clusters of computers so that you can run distributed applications on them. You can deploy a Ray cluster on AWS, GCP or on Kubernetes via the officially supported KubeRay (opens in a new tab) project.

Ray Core

ℹ️

Ray Core (opens in a new tab) is an open-source, Python, general purpose, distributed computing library that enables ML engineers and Python developers to scale Python applications and accelerate machine learning workloads.

Ray Core is the foundation that Ray's ML libraries (Ray AIR) and third-party integrations (Ray ecosystem) are built on. This library enables Python developers to easily build scalable, distributed systems that can run on a laptop, cluster, cloud or Kubernetes.

Ray Core provides a small number of core primitives (i.e., tasks, actors, objects) for building and scaling distributed applications. Expanding on the key abstractions mentioned before:

  1. Tasks (opens in a new tab): Remote, stateless Python functions
    • Ray tasks are arbitrary Python functions that are executed asychronously on separate Python workers on a Ray cluster nodes. Users can specify their resource requirements in terms of CPUs, GPUs, and custom resources which are used by the cluster scheduler to distribute tasks for parallelized execution.
  2. Actors (opens in a new tab): Remote stateful Python classes.
    • What tasks are to functions, actors are to classes; Actors extend the Ray API from functions (tasks) to classes. An actor is a stateful worker (or a service), and the methods of an actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements.
  3. Objects (opens in a new tab): In-memory, immutable objects or values that can be accessed anywhere in the computing cluster.
    • In Ray, tasks and actors create and compute on objects. These remote objects can be stored anywhere in a Ray cluster. Object References are used to refer to them, and they are cached in Ray's distributed shared memory object store.
  4. Placement Groups (opens in a new tab)
    • Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD).
    • Placement groups are generally used for gang-scheduling actors, but also support tasks.
  5. Environment Dependencies (opens in a new tab)
    • When Ray executes tasks and actors on remote machines, their environment dependencies (e.g., Python packages, local files, environment variables) must be available for the code to run. To address this problem, you can (1) prepare your dependencies on the cluster in advance using the Ray Cluster Launcher (opens in a new tab), or (2) use Ray's runtime environments (opens in a new tab) to install them on the fly.

Ray AI Runtime

ℹ️

Ray AI Runtime (AIR) (opens in a new tab) is an open-source, Python, domain-specific set of libraries that equip ML engineers, data scientists, and researchers with a scalable and unified toolkit for ML applications.

Ray AIR is built on top of Ray Core and focuses on distributed both individual and end-to-end machine learning workflows.

Ray AIR enables end-to-end ML development and provides multiple options to integrate with other tools and libraries from the MLOps ecosystem.

Ray AIR enables end-to-end ML development and provides multiple options to integrate with other tools and libraries from the MLOps ecosystem. Each of the five native libraries that Ray AIR wraps distributes a specific ML task:

  1. Ray Data (opens in a new tab): Scalable, framework-agnostic data loading and transformation across training, tuning, and prediction.
  2. Ray Train (opens in a new tab): Distributed multi-node and multi-core model training with fault tolerance that integrates with popular training libraries.
  3. Ray Tune (opens in a new tab): Scales hyperparameter tuning to optimize model performance.
  4. Ray Serve (opens in a new tab): Deploys models for online inference, with optional microbatching to improve performance.
  5. Ray RLlib (opens in a new tab): Distributed reinforcement learning workloads that integrate with the other Ray AIR libraries.

For more detail and a walkthrough, see this page.

Integrations and ecosystem libraries

Ray integrates with a rich ecosystem (opens in a new tab) of the most popular distributed computation libraries, machine learning libraries and frameworks. Instead of trying to impose a new standards, Ray allows you to scale existing workloads by unifying tools in a common interface. This interface enables you to run ML tasks in a distributed way, a property most of the respective backends don't have, or not to the same extent. Here are a handful of integrations to highlight:

Ray also provides an easy way to "glue" and integrate distributed libraries, with code that retains the look and feel of standard Python.

Shifting the focus towards libraries means users can ignore the complexities of distributed computation and, instead, zero in on the algorithmic aspect. Drawing parallels, this reminds us of the third generation GPU architectures when tools like Unity or Unreal Engine 4 became popular. In these situations, the intricacies of the GPU were de-emphasized in favor of these engines' programmability and versatility.