Cloud & MLOps ☁️

Cloud & MLOps Knowledge Repository

This section of my knowledge repo houses my notes and insights from various AWS Certifications I've gotten over the years. It chronologically traces my journey from the foundational AWS Certified Cloud Practitioner, to the more specialized AWS Certified ML Specialty, and finishes with details of MLOps using Ray.


This section covers topics related to AWS, including the various services offered by the cloud provider. It was initially based off of notes from the certifications I've taken, as well as notes I've made along the way while working with AWS.

AWS Certified Cloud Practitioner

The AWS Certified Cloud Practitioner certification stands as an intro to AWS Cloud, its services, and the encompassing terminologies. It's the ideal stepping stone for newbies to the IT or cloud realm. Moreover, it provides the foundational literacy for professionals in various lines of business. Naturally, it was the place I began when starting out with AWS.

AWS Certified Machine Learning Specialty

The AWS Certified Machine Learning - Specialty was the next accreditation I took. Its used to vouch for one's adeptness in building, training, tuning, and deploying ML models on AWS, but it also underscores an organization's commitment to cloud initiatives. This certification targeted professionals in development or data science roles.

MLOps with Ray

While the ML code often sits at the heart of discussions, It just scratches the surface of the broader architecture it is part of. This complexity was outlined in the seminal paper by Sculley, D. et al. (opens in a new tab), aptly titled "Hidden Technical Debt in Machine Learning Systems."

MLOps Overview

This paper was the one that coined the idea of MLOps - a concept that balances the mainstream fascination with model accuracy against the critical backdrop of the systems necessary for deploying and maintaining these models in production. OpenAI's recent paper AI and Compute (opens in a new tab) suggests that the amount of compute needed to train AI models has roughly doubled every 3.5 months since 2012. However, distributed systems are hard to program and come with a host of challenges, ranging from communication and scheduling to security and transparency.

Thats what this section, mainly around Ray, is for - it's a revolutionary tool designed to simplify the process for developers, allowing easy Python code execution on clusters.