Cloud & MLOps ☁️
What is MLOps

MLOps: Bridging the Gap Between ML and DevOps

MLOps Overview

In the complex ecosystem of machine learning (ML) systems, the ML code often represents just a sliver of the entire architecture, depicted as the central black box. Much of the challenge lies in the surrounding infrastructure, which is intricate and multifaceted. This perspective is informed by the influential 2015 paper by Sculley, D. et al. (opens in a new tab), titled "Hidden Technical Debt in Machine Learning Systems."

Originating from this Google research, the concept of MLOps was coined. It underscores that while popular discussions and media gravitate towards model accuracy (represented by the central black box), there is scant attention on the multifarious systems imperative for deploying and sustaining a model in a production setting.

MLOps isn't merely about technology integration; it necessitates a profound organizational transformation (opens in a new tab). This involves an orchestrated blend of teams, processes, and technological frameworks to launch ML solutions. The ideal setup ensures robustness, scalability, and reliability, all the while streamlining collaborations across teams.

However, charting the MLOps course isn't without its challenges. Regulatory landscape and the inherent opaqueness of some algorithms, especially within the realm of deep learning, complicate the process of auditability and explainability. The quest for a robust ML governance structure can sometimes entangle enterprises in bureaucratic intricacies, potentially stymying ML adoption for specific sectors or applications.

AI Compute

Amount of compute used in the largest AI training runs far outpaces the processing power of individual CPUs, GPUs, and TPUs. Original diagram from OpenAI (opens in a new tab) with overlaid annotations.

Distributed computing is becoming increasingly relevant for modern machine learning systems aswell. OpenAI's recent paper AI and Compute (opens in a new tab) suggests that the amount of compute needed to train AI models has roughly doubled every 3.5 months since 2012.

However, distributed systems are hard to program. Scaling a Python application to a cluster introduces challenges in communication, scheduling, security, failure handling, heterogeneity, transparency, and much more.

The exponential growth of data and model sizes mean distributed computing is fast becoming a staple in numerous ML architectures. Yet, obstacles persist. Many ML libraries operate in isolation and are not well integrated (opens in a new tab), hindering seamless integration. For example you can use Spark or Dask for scaling your data preprocessing, then Horovod or PyTorch for scaling your training. Such disjointed operations can lead to compatibility issues (like Java versus Python) and an absence of synergy among these libraries. As a consequence, enterprises often find themselves storing intermediate results on disk or cloud storage, compelling them to seek workflow orchestrators to weave these disparate steps into a cohesive process.

This context drove the development of Ray: a solution to enable developers to run Python code on clusters without having to think about how to orchestrate and utilize individual machines.