Deployment, Implementations and Operations

It's one thing to build a machine learning model and train it off line, but how do you deploy it into production? Not only do your models need to scale and perform reliably, they need to be secure as well.

We'll talk about accelerating your machine learning systems using elastic inference and pushing your models to the edge and to devices, using SageMaker Neo. We'll also dive into the intersection of SageMaker and AWS security and how SageMaker interacts with IAM, KMS and private VPCs, and we'll talk about choosing appropriate EC2 instance types for Sagemaker and how to perform A B tests in a production environment to try out new models on real world data, at scale

SageMaker and Docker Containers

All models in SageMaker are hosted in Docker containers that is registered with ECR (Elastic container Registry). This could include a number of things:
- Pre-built deep learning
- Pre-built scikit-learn and Spark ML
- Pre-built Tensorflow, MXNet, Chainer, PyTorch
  - Distributed training via Horovod or Parameter Servers
  - One thing that's worth mentioning here, though, is that tensor flow does not get distributed across multiple machines automatically.
  - So if you do need to distribute that training across multiple machines that might have multiple GPU's, there's a couple of ways to do that. One is using a framework called Horovod or Parameter Servers. So these are little bits of trivia and important words that you might want to remember.
- Your own training and inference code! Or extend a pre-built image.
This allows you to use any script or algorithm within SageMaker, regardless of runtime or language
- Containers are isolated, and contain all dependencies and resources needed to run

Using Docker

Docker containers are created from images
Images are built from a Dockerfile
Images are saved in a Repository
- Amazon Elastic Container Registry
So we have Docker images that I've prepared ahead of time in Amazon ECR.
- These can include both inferences and training models
It will pluck out that training model from ECR and use that docker container for training, pulling in training data from S3.
Those training jobs are all run after deploying those Docker images and allowing it to access that S3 training data.
The resulting trained models (model artifacts) are then stored into S3 again
That S3 model artifact is then made accessible to the inference code in the model deployment stage
So again, we have a docker container that contains our inference code that can consume that stored model artifacts to actually generate inferences in real time.
That models deploy to a fleet of servers.
Enpoints exposed for runtime usage from outside applications

Structure of a Training Container

Everything is inside the opt/ml directory:

- - hyperparameters.json
  - resourceConfig.json
- - - <input data>
- <model files>
- <script files>

The model directory is used for deployment.

Library for making containers compatible with SageMaker

RUN pip install sagemaker-containers in your Dockerfile

Structure of your Docker image

WORKDIR
- nginx.conf
  - A configuration file for the MGI annex front end.
  - we're gonna be running a web server at deployment time and that's how we can configure that webserver
- predictor.py
  - the program that implements a flask web server.
  - That's the program that actually implements a flask web server for making those predictions at runtime.
  - You're gonna need to customize that code to actually perform predictions for your application in whatever way it works.
- serve/
  - contains deployment stuff
  - that program in there will be started when the container is started for hosting.
  - That file just launches the G Unicorn's server, which runs multiple instances of a flask application that is defined in your predictor.py script.
- train/
  - training stuff
  - contains the program that's invoked when you run the container for training.
- wsgi.py
  - a small wrapper that's used to invoke your flask application for serving results.

you can have separate training and inference images if you want to, or you can combine them together into this structure. Assembling it all in a Dockerfile:

FROM tensorflow/tensorflow:2.0.0a0
RUN pip install sagemaker-containers
# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py
# Defines train.py as script entrypoint in opt/ml/code
ENV SAGEMAKER_PROGRAM train.py

Environment variables

But there are many other environment variables you might want to set as well.

SAGEMAKER_PROGRAM # Run a script inside /opt/ml/code
SAGEMAKER_TRAINING_MODULE
SAGEMAKER_SERVICE_MODULE
SM_MODEL_DIR # Model checkpoints are saved and pushed into S3
SM_CHANNELS # OR  SM_CHANNEL
SM_HPS # OR SM_HP
SM_USER_ARGS

There and many more

Using your own image

cd dockerfile
!docker build -t foo .

from sagemaker.estimator import Estimator
 
estimator = Estimator(
    image_name='foo',
    role='SageMakerRole',
    train_instance_count=1,
    train_instance_type='local'
)
 
estimator.fit()

Production Variants

You can test out multiple models on live traffic using Production Variants
- Run 2 models in parallel to see which one perfoms better
- Variant Weights tell SageMaker how to distribute traffic among them
  - So, you could roll out a new iteration of your model at say 10% variant weight
  - Send 10% of traffic to the new model
    - once I'm sure that this new variant is actually doing better than the older one, I could wrap it up to 100% and then discard the older production variant.
- Once you're confident in its performance, ramp it up to 100%
This lets you do A/B tests, and to validate performance in real- world settings
Offline validation isn't always useful
- Always risky to launch new code in general

SageMaker on the Edge (IoT)

**IoT Core **collects data from each shared bike, IoT Analytics retrieves messages from the shared bikes as they stream data data.

**IoT Analytics **also enriches the streaming data with your external data sources.

AWS IoT Greengrass seamlessly extends AWS to edge devices so they can act locally on the data they generate, while still using the cloud for management, analytics, and durable storage. With AWS IoT Greengrass, connected devices can run AWS Lambda (opens in a new tab) functions, Docker containers, or both, execute predictions based on machine learning models, keep device data in sync, and communicate with other devices securely - even when not connected to the Internet.

SageMaker uses two types of models to search for the optimum hyperparameters for your model: Random Search and Bayesian Search. For most models, Bayesian Search requires less training jobs to reach your optimal hyperparameter settings.

Amazon SageMaker Neo enables developers to train machine learning models once and run them anywhere in the cloud and at the edge. Amazon SageMaker Neo optimizes models to run up to twice as fast, with less than a tenth of the memory footprint, with no loss in accuracy.

For training, the data need to go through a series of conversions and transformations, including:

Training data serialization (handled by you)
Training data deserialization (handled by the algorithm)
Training model serialization (handled by the algorithm)
Trained model deserialization (optional, handled by you)

SageMaker Neo

Train once, run anywhere
- Edge devices
  - ARM, Intel, Nvidia processors
  - Embedded in whatever - your car?
  - could actually deploy this stuff to run locally within the machine that it is.
  - in applications where you care a lot about latency, like for example a self-driving car, you don't want to be waiting, you know, several hundred milliseconds to get a response back from the internet somewhere to decide whether or not to slam on the brakes
Optimizes code for specific devices
- Neo is a way to compile your inference code to edge devices and make them optimized for different devices that might be embedded somewhere, so you can train once and run anywhere
- Tensorflow, MXNet, PyTorch, ONNX, XGBoost
It supports multiple architectures for edge devices including ARM, Intel, and Nvidia processors.
Consists of a compiler and a runtime
- compiler is what re compiles our code into the by code expected by those edge processors
- runtime component runs on those edge devices to consume that neo generated code

Neo & AWS IoT Greengrass

Neo pairs well with AWS IoT Greengrass
Neo-compiled models can be deployed to an HTTPS endpoint
- Hosted on C5, M5, M4, P3, or P2 instances
- Must be same instance type used for compilation
OR! You can deploy to IoT Greengrass
- This is how you get the model to an actual edge device
- Inference at the edge with local data, using model trained in the cloud
- Uses Lambda inference applications
- Overall:
  - Neo compiles your trained model into specific architectures that might be deployed to the edge
  - IoT Greengrass is what gets it there.

SageMaker Security

General AWS Security

Use Identity and Access Management (IAM)
- Set up user accounts with only the permissions they need
- Set up user accounts
Use MFA
Use SSL/TLS when connecting to anything
Use CloudTrail to log API and user activity
- CloudWatch - Monitoring logs, and setting alarms
- CloudTrail - Auditing, seeing what everyone did
Use encryption whenever apporopriate
- Be careful with PII

Protecting your Data at Rest in SageMaker

AWS Key Management Service (KMS)
- So you can provide a key managed by KMS to actually perform encryption of that data at rest.
- Accepted by notebooks and all SageMaker jobs
- That includes artifacts coming from training, tuning, batch transform, and from your endpoints
- Notebooks and everything under /opt/ml/ and /tmp can be encrypted with a KMS key aswell
S3
- Can use encrypted S3 buckets for training data and hosting models
- S3 can also use KMS

Protecting Data in Transit in SageMaker

All traffic supports TLS / SSL within SageMaker
IAM roles are assigned to SageMaker to give it permissions to access resources
- Follow principle of least access
Inter-node training communication may be optionally encrypted
- For training happening across multiple nodes - can encrypt that traffic aswell
- Can increase training time and cost with deep learning
- AKA inter-container traffic encryption
- Enabled via console or API when setting up a training or tuning job

SageMaker & VPCs

Training jobs run in a Virtual Private Cloud (VPC)
You can use a private VPC for even more security
- You'll need to set up S3 VPC endpoints to enable this communication
- Custom endpoint policies and S3 bucket policies can keep this secure
Notebooks are Internet-enabled by default
- This can be a security hole
- If disabled, your VPC needs an interface endpoint (PrivateLink) or NAT Gateway, and allow outbound connections, for training and hosting to work
Training and Inference Containers are also Internet-enabled by default
- Network isolation is an option, but this also prevents S3 access
- Will need to work around this somehow

SageMaker & IAM

User permissions for:
- CreateTrainingJob
- CreateModel
- CreateEndpointConfig
- CreateTransformJob
- CreateHyperParameterTuningJob
- CreateNotebookInstance
- UpdateNotebookInstance
Predefined policies:
- AmazonSageMakerReadOnly
- AmazonSageMakerFullAccess
- AdministratorAccess
- DataScientist

SageMaker Logging and Monitoring

CloudWatch can log, monitor and alarm on:
- Invocations and latency of endpoints
- Health of instance nodes (CPU, memory, etc)
- Ground Truth (active workers, how much they are doing)
CloudTrail records actions from users, roles, and services within SageMaker
- Log files delivered to S3 for auditing

Managing SageMaker Resources

Choosing your instance types

We covered this under "modeling", even though it's an operations concern
In general, algorithms that rely on deep learning will benefit from GPU instances (P2 or P3) for training
Inference is usually less demanding and you can often get away with compute instances there (C4, C5)
GPU instances can be really pricey

Managed Spot Training

Can use EC2 Spot instances for training
- Save up to 90% over on-demand instances
Spot instances can be interrupted!
- Use checkpoints to S3 so training can resume
Can increase training time as you need to wait for spot instance resources to become available

Elastic Inference

By using Amazon Elastic Inference (EI), you can speed up the throughput and decrease the latency of getting real-time inferences from your deep learning models that are deployed as Amazon SageMaker hosted models, but at a fraction of the cost of using a GPU instance for your endpoint.

Accelerates deep learning inference
- At fraction of cost of using a GPU instance for inference
EI accelerators may be added alongside a CPU instance
- you add an elastic inference accelerator machine type alongside a CPU instance
- ml.eia1.medium / large / xlarge
EI accelerators may also be applied to notebooks
- Speedier experience
Works with Tensorflow and MXNet pre-built containers
- ONNX may be used to export models to MXNet to make them compatible
Works with custom containers built with EI - enabled Tensorflow or MXNet
Works with Image Classification and Object Detection built-in algorithms

SageMaker and Availability Zones

SageMaker automatically attempts to distribute instances across availability zones
But you need more than one instance for this to work!
Deploy multiple instances for each production endpoint, even if you only need 1
Configure VPC's with at least two subnets, each in a different AZ
- So that way, we can make sure that things are being deployed across different Availability Zones so if there's a catastrophic failure in one place, your application will keep on running regardless.

Automatic Scaling

When you are deploying inference model to production, you set up a scaling policy to define target metrics, min/max capacity, cooldown periods etc
Works with CloudWatch
- CloudWatch is a repository of performance metrics associated with your endpoints, which SageMaker can use to determine if you have the right amount of them.
Dynamically adjusts number of instances for a production variant
- Automatically add inference nodes
Load test your configuration before using it!

Serverless Inference

Takes Automatic Scaling to the next level.
Introduced for 2022
Specify your container, memory requirement, concurrency requirements
- Let AWS worry about the actual amount of hardware you need to make that happen
Underlying capacity is automatically provisioned and scaled
Good for infrequent or unpredictable traffic; will scale down to zero when there are no requests.
Charged based on usage
Monitor via CloudWatch
- ModelSetupTime
  - How long it took to deploy as it was being automatically scaled up or down.
- Invocations
  - The ones with errors for example
- MemoryUtilization
  - you can see if your requirements about the memory requirements for the containers and for concurrency were accurate or not.

But assuming that you give Serverless Inference good information about your containers and how they perform, it will automatically add and remove instances under the hood to handle the traffic that you're throwing at it.

Amazon SageMaker Inference Recommender

Recommends best instance type & configuration for your models
- if you do want to go and deploy your own inference instance types, this will at least give you some guidance on what types to use
Automates load testing & model tuning
Deploys to optimal inference endpoint from the results of those tests
How it works:
- Register your model to the model registry
- Benchmark different endpoint configurations
- Collect & visualize metrics to decide on instance types
- Existing models from zoos may have benchmarks already
Instance Recommendations
- Quick answer for the best instance for inference
Runs load tests on recommended instance types
Takes about 45 minutes
Endpoint Recommendations
Custom load test
- "endpoint recommendations"
- You specify instances, traffic patterns, latency requirements, throughput requirements
- Takes about 2 hours
- More specific answer for your specific SLA's

Comparison:

Inference Pipelines

SageMaker Inference Pipelines allows you to bundle and export your pre and post-processing steps from your training process and deploy them as part of your Inference Pipeline. Inference Pipelines are fully managed by AWS.

You can use the Amazon SageMaker model tracking capability to search key model attributes such as hyperparameter values, the algorithm used, and tags associated with your team's models. This SageMaker capability allows you to manage your team's experiments at the scale of up to thousands of model experiments.

Your inference pipeline is immutable. You change it by deploying a new one via the UpdateEndpoint API. SageMaker deploys the new inference pipeline, then switches incoming requests to the new one. SageMaker then deletes the resources associated with the old pipeline.

you can also use more than one container and string them together using inference pipelines.
Linear sequence of 2-15 containers
Any combination of pre-trained built-in algorithms, or your own algorithms in Docker containers, and put them together
Combine pre-processing, predictions, post-processing of those predictions all in different containers in a pipeline
Spark ML and scikit-learn containers OK
- Spark ML can be run with Glue or EMR
- Serialized into MLeap format
Can handle both real-time inference and batch transforms
It's just a way of chaining together multiple inference containers into one pipeline of results.

Modern SageMaker Higher-Level AI/ML Services