Cloud & MLOps ☁️
SageMaker's Built-In Algorithms

SageMaker's Built-In Algorithms

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly.

In general, the protobuf recordIO format, used for training data, is the optimal way to load data into your model for training. When you use the protobuf recordIO format you can also take advantage of pipe mode when training your model. Pipe mode, used together with the protobuf recordIO format, gives you the best data load performance by streaming your data directly from S3 to your EBS volumes used by your training instance.

Linear Learner

Linear models are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labelled examples (x,y)(x, y). xx is a high-dimensional vector and yy is a numeric label. For binary classification problems, the label must be either 00 or 11.

For multiclass classification problems, the labels must be from 0 to num_classes - 1. For regression problems, y is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector x to an approximation of the label y.

The best model optimizes either of the following:

  • Continuous objectives, such as mean square error, cross entropy loss, absolute error.

  • Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy.

For training, the linear learner algorithm supports both recordIO-wrapped protobuf and CSV formats.

For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats.

For regression (predictor_type='regressor'), the score is the prediction produced by the model. For classification (predictor_type='binary_classifier' or predictor_type='multiclass_classifier'), the model returns a score and also a predicted_label. The predicted_label is the class predicted by the model and the score measures the strength of that prediction.

  • For Linear regression
    • Fit a line to your training data
    • Predications based on that line
  • Can handle both regression (numeric) predictions *and classification predictions
    • For classification, a linear threshold function is used.
    • Can do binary or multi-class

Linear Learner

What training input does it expect?

  • RecordIO-wrapped protobuf

    • Float32 data only!
    • Most performant option
  • CSV

    • First column assumed to be the label
  • File or Pipe mode both supported

    • Well in file mode in SageMaker it will copy all of your training data over as a single file all at once to every training instance in your training fleet, whereas pipe mode will actually pipe it and stream it in from S3, as needed.
      • Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances
    • Obviously, pipe mode is gonna be more efficient especially with larger training sets, so, if you're having a problem where S3 is taking too long to train - like it's getting a hard time even getting started - a very simple optimization would be to use pipe mode instead of file mode.

How is it used?

  • Preprocessing
    • Training data must be normalized (so all features are weighted the same)
    • Linear Learner can do this for you automatically
    • Input data should be shuffled
  • Training
    • Uses stochastic gradient descent
    • Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
    • Multiple models are optimized in parallel. Chooses the most optimal one at the validation step.
    • Tune L1, L2 regularization to prevent overfitting
      • L1 is ending up doing feature selection, whereas L2, is sort of just weighting your individual features more smoothly
  • Validation
    • Most optimal model is selected

Important Hyperparameters

  • Balance_multiclass_weights: Gives each class equal importance in loss functions
  • Learning_rate, mini_batch_size
  • L1 Regularization
  • WdW_d: Weight decay (L2 regularization)

Instance Types

  • Training

    • Single or multi-machine CPU or GPU

    • Multi-GPU does not help

      • it does help to have more than one machine but it does not help to have more than one GPU on one machine


  • eXtreme Gradient Boosting
    • Boosted group of decision trees
    • New trees made to correct the errors of previous trees
    • Uses gradient descent to minimize loss as new trees are added
  • It's been winning a lot of Kaggle competitions
    • And it's fast, too
  • Can be used for classification
  • And also for regression
    • Using regression trees for numerical values

What training input does it expect?

  • XGBoost is weird, since it's not made for SageMaker.
    • It's just open source XGBoost
  • So, it takes CSV or libsvm input.
  • AWS recently extended it to accept recordIO-protobuf and Parquet as well.

How is it used?

  • Models are serialized/deserialized with Pickle with python
  • Can use as a framework within notebooks
    • Sagemaker.xgboost
    • You don't even have to deploy it to training hosts or you don't even have to, like, use a docker image with it.
  • Or as a built-in SageMaker algorithm
    • refer to the XGBoost Docker image in ECR and deploy that to a fleet of training host to do larger scale training jobs.

Important Hyperparameters

There are a lot of them. A few important ones are:

  • Subsample
    • Prevents overfitting
  • eta
    • Step size shrinkage, prevents overfitting
  • gamma
    • Minimum loss reduction to create a partition; larger = more conservative
  • alpha
    • L1 regularization term; larger = more conservative
  • lambda
    • L2 regularization term; larger = more conservative
  • eval_metric
    • set the metric that you're optimizing on while training.
    • Optimize on AUC (for False Positives), error, rmse...
    • For example, if you care about false positives more than accuracy, you might use AUC here
  • scale_pos_weight
    • Adjusts balance of positive and negative weights
    • Helpful for unbalanced classes
    • Might set to sum(negative cases) / sum(positive cases)
  • max_depth
    • Max depth of the tree
    • Too high and you may overfit

Instance Types

  • Uses CPU's only for multiple instance training

    • historically XG Boost was a C.P.U only algorithm

    • Still only choice if you want to do multiple instances across a cluster

  • Is memory-bound, not compute- bound

    • So, M5 is a good choice
  • As of XGBoost 1.2, single-instance GPU training is available

    • For example P3

    • Must set tree_method hyperparameter to gpu_hist

    • Trains more quickly and can be more cost effective.


What's it for?

  • Input is a sequence of tokens, output is a sequence of tokens

  • Machine Translation

  • Text summarization

  • Speech to text

  • Implemented with RNN's and CNN's with attention

What training input does it expect?

  • RecordIO-Protobuf

    • Tokens must be integers (this is unusual, since most algorithms want floating point data)
  • Start with tokenized text files

    • you can't just pass in a text file full of words. You need to actually build a vocabulary file that maps every word to a number
  • Convert to protobuf using sample code

    • Packs into integer tensors with vocabulary files

    • A lot like the TF/IDF lab we did earlier.

  • Must provide training data, validation data, and vocabulary files.

How is it used?

  • Training for machine translation can take days, even on SageMaker

  • Pre-trained models are available

    • See the example notebook
  • Public training datasets are available for specific translation tasks

Important Hyperparameters

  • Batch_size

  • Optimizer_type (adam, sgd, rmsprop)

  • Learning_rate

  • Num_layers_encoder

  • Num_layers_decoder

  • Can optimize on:

    • Accuracy

      • Vs. provided validation dataset
    • BLEU score

      • Compares against multiple reference translations
    • Perplexity

      • Cross-entropy

Instance Types

  • As a deep learning algorithm, it takes advantage of GPUs and you should just to use GPUs

  • Can only use GPU instance types (P3 for example)

  • Can only use a single machine for training

    • Cannot parallelize

    • But can use multi-GPU's on one machine

DeepAR Forecasting Algorithm

The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future.

What's it for?

  • Forecasting one-dimensional time series data

  • Uses RNN's

  • Allows you to train the same model over several related time series

    • Can have many input timeseries
  • Finds frequencies and seasonality

What training input does it expect?

  • JSON lines format

    • Gzip or Parquet
  • Each record must contain:

    • Start: the starting time stamp

    • Target: the time series values

  • Each record can contain:

    • Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series of product purchases)

    • Cat: categorical features

{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1], "dynamic_feat": [[1.1, 1.2, 0.5, ...]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat": [[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat": [[1.3, 0.4]]}

How is it used?

  • Always include entire time series for training, testing, and inference

  • Use entire dataset as training set, remove last time points for testing. Evaluate on withheld values.

  • Don't use very large values for prediction length (> 400)

  • Train on many time series and not just one when possible

Important Hyperparameters

  • Context_length

    • Number of time points the model sees before making a prediction

    • Can be smaller than seasonality's; the model will lag one year anyhow.

    • Can pick up seasonal trends based on historical data

  • Epochs

  • mini_batch_size

  • Learning_rate

  • Num_cells

Instance Types

  • Can use CPU or GPU

  • Single or multi machine

  • Start with CPU (C4.2xlarge, C4.4xlarge)

  • Move up to GPU if necessary

    • Only helps with larger models
  • CPU-only for inference

  • May need larger instances for hyperparameter tuning

    • One good thing though is that during training you can use single or multiple machines, so it's easy to scale this out if you need to.


What's it for?

  • Text classification

    • Predict labels for a sentence

    • Useful in web searches, information retrieval

    • Supervised

  • Word2vec

    • Creates a vector representation of words

    • Semantically similar words are represented by vectors close to each other

    • This is called a word embedding

    • It is useful for NLP, but is not an NLP algorithm in itself!

    • Used in machine translation, sentiment analysis

    • Remember it only works on individual words, not sentences or documents

What training input does it expect?

  • For supervised mode (text classification):

    • One sentence per line

    • First "word" in the sentence is the string _label_ followed by the label

    • Notice we make it all lowercase and add spaces around all punctuation

__label__4 linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2 bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .
  • Also accepts "augmented manifest text format"
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}
  • For Word2Vec just wants a text file with one training sentence per line.

How is it used?

  • Word2vec has multiple modes

    • Cbow (Continuous Bag of Words)

      • Actual order of words is thrown out.
    • Skip-gram

      • Can be distributed computation over many CPU nodes
    • Batch skip-gram

      • Can be distributed computation over many CPU nodes

Important Hyperparameters

  • Word2vec:

    • Mode (batch_skipgram, skipgram, cbow)

    • Learning_rate

    • Window_size

    • Vector_dim

    • Negative_samples

  • Text classification:

    • Epochs

    • Learning_rate

    • Word_ngrams

    • Vector_dim

Instance Types

  • For cbow and skipgram, recommend a single ml.p3.2xlarge

    • Any single CPU or single GPU instance will work
  • For batch_skipgram, can use single or multiple CPU instances

    • Can scale horizontally this way
  • For text classification, C5 recommended if less than 2GB training data. For larger data sets, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)


What's it for?

  • Remember word2vec from Blazing Text? It's like that, but arbitrary objects

    • More general purpose

    • It creates low-dimensional dense embeddings of high-dimensional objects

  • It is basically word2vec, generalized to handle things other than words.

  • Compute nearest neighbors of objects

  • Can visualize clusters

  • Genre prediction

  • Recommendations (similar items or users)

Can see similar items in this embedding space.

What training input does it expect?

  • Data must be tokenized into integers

  • Training data consists of pairs of tokens and/or sequences of tokens

    • Sentence - sentence

      • Find relationship between things based on these paired attributes
    • Labels-sequence (genre to description?)

    • Customer-customer

    • Product-product

    • User-item

{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}

How is it used?

  • Process data into JSON Lines and shuffle it

  • Train with two input channels, two encoders, and a comparator

  • Encoder choices (for each input path):

    • Average-pooled embeddings

    • CNN's

    • Bidirectional LSTM

  • Comparator is followed by a feed-forward neural network

Important Hyperparameters

  • The usual deep learning ones...

    • Dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay
  • Enc1_network, enc2_network

    • Choose hcnn, bilstm, pooled_embedding

Instance Types

  • Can only train on a single machine (CPU or GPU, multi-GPU OK)

    • Ml.m5.2xlarge

    • Ml.p2.xlarge

    • If needed, go up to ml.m5.4xlarge or ml.m5.12xlarge

  • Inference: use ml.p2.2xlarge

    • Use INFERENCE_PREFERRED_MODE environment variable when you're setting up the image for that inference

    • So, this environment variable allows you to optimize for encoder embeddings, which is what object2vec does, as opposed to classification or regression

    • environment variable to optimize for encoder embeddings rather than classification or regression.

Object Detection

What's it for?

  • Identify all objects in an image with bounding boxes

  • Detects and classifies objects with a single deep neural network

  • Classes are accompanied by confidence scores

  • Can train from scratch (maybe use ground truth), or use pre- trained models based on ImageNet


What training input does it expect?

  • RecordIO or image format (jpg or png)

  • With image format, supply a JSON file for annotation data for each image during training e.g.:

  "file": "your_image_directory/sample_image1.jpg",
  "image_size": [
      "width": 500,
      "height": 400,
      "depth": 3
  "annotations": [
      "class_id": 0,
      "left": 111,
      "top": 134,
      "width": 61,
      "height": 128
  "categories": [
      "class_id": 0,
      "name": "dog"

How is it used?

  • Takes an image as input, outputs all instances of objects in the image with categories and confidence scores

  • Uses a CNN with the Single Shot multibox Detector (SSD) algorithm

    • The base CNN can be VGG-16 or ResNet-50
  • Transfer learning mode / incremental training

    • Use a pre-trained model for the base network weights, instead of random initial weights

    • Continue train this model further

  • Uses flip, rescale, and jitter internally to avoid overfitting

Important Hyperparameters

  • Mini_batch_size

  • Learning_rate

  • Optimizer

    • Sgd, adam, rmsprop, adadelta

Instance Types

  • Use GPU instances for training (multi-GPU and multi-machine OK)

    • Ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8clarge, ml.p3.16xlarge
  • Use CPU or GPU for inference

  • C5, M5, P2, P3 all OK

Image Classification

What is it for?

  • Assign one or more labels to an image

  • Doesn't tell you where objects are, just what objects are in the image

What training input does it expect?

  • Apache MXNet RecordIO

    • Not protobuf!

    • This is for interoperability with other deep learning frameworks.

  • Or, raw jpg or png images

  • Image format requires .lst files to associate image index, class label, and path to the image

  • Augmented Manifest Image Format enables Pipe mode

    • Allows it to stream that data in from S3 as opposed to copying everything over

Example of .lst file:

5 1 your_image_directory/train_img_dog1.jpg
1000 0 your_image_directory/train_img_cat1.jpg
22 1 your_image_directory/train_img_dog2.jpg

Augmented Manifest Image Format:

{"source-ref":"s3://image/filename1.jpg", "class":"0"}
{"source-ref":"s3://image/filename2.jpg", "class":"1", "class-metadata": {"class-name": "cat", "type" : "groundtruth/image- classification"}}

How is it used?

  • ResNet CNN under the hood

  • Full training mode

    • Network initialized with random weights
  • Transfer learning mode

    • Initialized with pre-trained weights

    • The top fully-connected layer is initialized with

    • random weights

    • Network is fine-tuned with new training data

  • Default image size is 3-channel 224x224 (ImageNet's dataset)

Important Hyperparameters

  • The usual suspects for deep learning

    • Batch size, learning rate, optimizer choice
  • Optimizer-specific parameters

    • Weight decay, beta 1, beta 2, eps, gamma

Instance Types

  • GPU instances for training (P2, P3) Multi-GPU and multi-machine OK.

  • CPU or GPU for inference (C4, P2, P3)

Semantic Segmentation Algorithm

The Amazon SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.

What's it for?

  • Pixel-level object classification

  • Different from image classification - that assigns labels to whole images

  • Different from object detection - that assigns labels to bounding boxes

  • Useful for self-driving vehicles, medical imaging diagnostics, robot sensing

  • Produces a segmentation mask

What training input does it expect?

  • JPG Images and PNG annotations

  • For both training and validation

  • Label maps to describe annotations in English

  • Augmented manifest image format supported for Pipe mode.

  • JPG images accepted for inference

How is it used?

  • Built on MXNet Gluon and Gluon CV

  • Choice of 3 algorithms:

    • Fully-Convolutional Network (FCN)

    • Pyramid Scene Parsing (PSP)

    • DeepLabV3

  • Choice of backbones:

    • ResNet50

    • ResNet101

    • Both trained on ImageNet

  • Incremental training, or training from scratch, supported too

Important Hyperparameters

  • Epochs, learning rate, batch size, optimizer, etc

  • Algorithm

  • Backbone

Instance Types

  • More restrictive: Only GPU supported for training (P2 or P3) on a single machine only

    • Specifically ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, or ml.p3.16xlarge

    • Cannot be parallelized

  • Inference on CPU (C5 or M5) or GPU (P2 or P3)

Random Cut Forest (RCF) Algorithm

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

What's it for?

  • Anomaly detection

  • Unsupervised

  • Detect unexpected spikes in time series data

  • Breaks in periodicity

  • Unclassifiable data points

  • Assigns an anomaly score to each data point

  • Based on an algorithm developed by Amazon that they seem to be very proud of!

What training input does it expect?

  • RecordIO-protobuf or CSV

  • Can use File or Pipe mode on either case

  • Optional test channel for computing accuracy, precision, recall, and F1 on labeled data (anomaly or not)

    • Unsupervised but can do for testing

How is it used?

  • Creates a forest of trees where each decision tree is a partition of the training data

    • What it does is it looks at expected change in complexity of the tree as a result of adding a point into it

    • So if you add a new data point into this decision tree and it causes a whole bunch of branches to form off, it says well this might be anomalous

    • They're basically using the properties of a decision tree and saying, okay, the fact that a decision tree needs to make a bunch of new branches to accommodate the new data, probably means there's something weird about that data point.

  • Data is sampled randomly

  • Then trained

  • RCF shows up in Kinesis Analytics as well; it can work on streaming data too.

Important Hyperparameters

  • Num_trees

    • Increasing reduces noise
  • Num_samples_per_tree

    • Should be chosen such that 1/num_samples_per_tree approximates the ratio of anomalous to normal data

Instance Types

  • Does not take advantage of GPUs

  • Use M4, C4, or C5 for training

  • ml.c5.xl for inference

Neural Topic Model (NTM) Algorithm

Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a latent representation because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

What's it for?

  • Organize documents into topics

  • Classify or summarize documents based on topics

  • It's not just TF/IDF

    • "bike", "car", "train", "mileage", and "speed" might classify a document as "transportation" for example (although it wouldn't know to call it that)
  • Unsupervised methods

    • Algorithm is "Neural Variational Inference"

What training input does it expect?

  • Four data channels

    • "train" is required - not really training though as it is unsupervised

    • "validation", "test", and "auxiliary" optional

  • recordIO-protobuf or CSV

  • Words must be tokenized into integers

    • those words first must be tokenized into integers

    • you don't just pass in raw text, first, you have to actually break up and convert those documents into tokens for each word and also pass into a vocabulary file that maps those words to the numbers

    • The "auxiliary" channel is for the vocabulary

  • File or pipe mode

How is it used?

  • You define how many topics you want

  • These topics are a latent representation based on top ranking words

  • One of two topic modeling algorithms in SageMaker - you can try them both!

Important Hyperparameters

  • Lowering mini_batch_size and learning_rate can reduce validation loss

    • At expense of training time
  • Num_topics

Instance Types

  • GPU or CPU

    • GPU recommended for training

    • CPU OK for inference

    • CPU is cheaper


What's it for?

  • Latent Dirichlet Allocation

  • Another topic modeling algorithm

    • Not deep learning
  • Unsupervised

    • The topics themselves are unlabeled; they are just groupings of documents with a shared subset of words
  • Can be used for things other than words

    • Cluster customers based on purchases

    • Harmonic analysis in music

What training input does it expect?

  • Train channel, optional test channel

    • If you want to measure accuracy

    • It is unsupervised, so doesn't really help

  • recordIO-protobuf or CSV

  • Need to tokenize everything

    • Integers coded for each word

    • Each document has counts for every word in vocabulary (in CSV format)

  • Pipe mode only supported with recordIO

How is it used?

  • Unsupervised; generates however many topics you specify

  • Optional test channel can be used for scoring results

    • Per-word log likelihood

    • Metric for measuring how well LDA does

  • Functionally similar to NTM, but CPU-based

  • Therefore maybe cheaper / more efficient

Important Hyperparameters

  • Num_topics

  • Alpha0

    • Initial guess for 'concentration parameter'

    • Smaller values generate sparse topic mixtures

    • Larger values (>1.0) produce uniform mixtures

Instance Types

  • Single-instance CPU training

    • Cannot parallelise training or use GPU


What's it for?

  • K-Nearest-Neighbors

  • Supervised

  • Simple classification or regression algorithm

  • Classification

    • Find the K closest points to a sample point and return the most frequent label
  • Regression

    • Find the K closest points to a sample point and return the average value

What training input does it expect?

  • Train channel contains your data

  • Test channel emits accuracy or MSE

    • If you want to measure it
  • recordIO-protobuf or CSV training

    • First column is label
  • File or pipe mode on either

How is it used?

  • SageMaker takers it to the next level

  • Data is first sampled

  • SageMaker includes a dimensionality reduction stage

    • Avoid sparse data ("curse of dimensionality")

    • At cost of noise / accuracy

    • "sign" or "fjlt" methods

  • Build an index for looking up neighbors

  • Serialize the model

  • Query the model for a given K

Important Hyperparameters

  • K!

  • Sample_size

Instance Types

  • Training on CPU or GPU

    • Ml.m5.2xlarge

    • Ml.p2.xlarge

  • Inference

    • CPU for lower latency

    • GPU for higher throughput on large batches

K-Means Algorithm

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.

Amazon SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, the version used by Amazon SageMaker streams mini-batches (small, random subsets) of the training data.

What's it for?

  • Unsupervised clustering technique

  • Divide data into K groups, where members of a group are as similar as possible to each other

    • You define what "similar" means

    • Measured by Euclidean distance

  • Sagemaker brings Web-scale K-Means clustering

    • At large scale this is difficult at scale

    • Can now do at large scale

What training input does it expect?

  • Train channel, optional test

  • Train ShardedByS3Key flag, test FullyReplicated

  • recordIO-protobuf or CSV

  • File or Pipe on either

How is it used?

  • Every observation mapped to n-dimensional space (n = number of features)

  • Works to optimize the center of K clusters

    • "extra cluster centers" may be specified to improve accuracy (which end up getting reduced to k)

    • K = k*x

      • Uses more clusters at the beginning and consolidates it down to the number you want over time
  • Algorithm:

    • Determine initial cluster centers

      • Random or k-means++ approach

      • K-means++ tries to make initial clusters far apart

    • Iterate over training data and calculate cluster centers

    • Reduce clusters from K to k

      • Using Lloyd's method with kmeans++ to do that reduction

Important Hyperparameters

  • K!

    • Choosing K is tricky

    • Plot within-cluster sum of squares as function of K

    • Use "elbow method"

    • Basically optimize for tightness of clusters

  • Mini_batch_size

  • Extra_center_factor

  • Init_method

    • Random or kmeans++

Instance Types

  • CPU or GPU, but CPU recommended

    • Only one GPU per instance used on GPU So use p*.xlarge if you're going to use GPU


What's it for?

  • Principal Component Analysis

  • Dimensionality reduction

    • Project higher-dimensional data (lots of features) into lower-dimensional (like a 2D plot) while minimizing loss of information

    • The reduced dimensions are called components

      • First component has largest possible variability

      • Second component has the next largest...

  • Unsupervised

What training input does it expect?

  • recordIO-protobuf or CSV

  • File or Pipe on either

How is it used?

  • Covariance matrix is created, then singular value decomposition (SVD)

  • Two modes

    • Regular

      • For sparse data and moderate number of observations and features
    • Randomized

      • For large number of observations and features

      • Uses approximation algorithm and scales better.

Important Hyperparameters

  • Algorithm_mode

    • Regular or Randomised
  • Subtract_mean

    • Unbiases the data upfront

Instance Types

  • GPU or CPU

  • It depends "on the specifics of the input data"

Factorization Machines

What's it for?

  • Specialize in classification or regression with with sparse data

    • Item recommendations

      • Predict a item a user may like. A user doesn't interact with most of them
    • Click prediction

      • Probably wont interact with most pages on the internet
    • Since an individual user doesn't interact with most pages / products the data is sparse

  • Supervised Method

  • Classification or regression

    • you might be saying I want to predict whether this person likes this product or not or you might want to actually do a regression of a specific rating value that they might assigned to that item.
  • Limited to pair-wise interactions

    • User -> item for example

What training input does it expect?

  • recordIO-protobuf with Float32

  • Sparse data means CSV isn't practical

    • Most will be commas as we are dealing with sparse data here

How is it used?

  • Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?)

  • Usually used in the context of recommender systems

  • That's what factorization machines do is trying to find these factors of the matrix that we can use to multiply together to figure out, well, given this matrix of a user's items that they liked, what do we think the resulting ratings would be for things they haven't seen yet

Important Hyperparameters

  • Initialization methods for bias, factors, and linear terms

    • Uniform, normal, or constant

    • Can tune properties of each method

Instance Types

  • CPU or GPU

  • CPU recommended

  • GPU only works with dense data

IP Insights

What's it for?

  • All about finding fishy behavior in your web logs

    • Security tool to flag up suspicious behavior
  • Unsupervised learning of IP address usage patterns

  • Identifies suspicious behavior from IP addresses

    • Identify logins from anomalous IP's

    • Identify accounts creating resources from anomalous IP's

What training input does it expect?

  • User names, account ID's can be fed in directly; no need to pre-process a lot

  • Training channel, optional validation (computes AUC score)

  • CSV only

    • Entity, IP

How is it used?

  • Uses a neural network to learn latent vector representations of entities and IP addresses.

    • try to learn what specific IP addresses do.
  • Entities are hashed and embedded

    • Need sufficiently large hash size
  • Automatically generates negative samples during training by randomly pairing entities and IP's

    • For the case when we have too few training points.

Important Hyperparameters

  • Num_entity_vectors

    • Hash size

    • Set to twice the number of unique entity identifiers

  • Vector_dim

    • Size of embedding vectors

    • Scales model size

    • Too large results in overfitting

  • Epochs, learning rate, batch size, etc.

Instance Types

  • CPU or GPU

    • GPU recommended

    • Ml.p3.2xlarge or higher

    • Can use multiple GPU's

    • Size of CPU instance depends on vector_dim and num_entity_vectors

Reinforcement Learning

  • You have some sort of agent that "explores" some space

  • As it goes, it learns the value of different state changes in different conditions

  • Those values inform subsequent behavior of the agent

    • it just learns for a given position within this environment and a given set of things around me, what's the best thing to do

    • Explores the space over time

  • Examples: Pac-Man, Cat & Mouse game (game AI)

    • Supply chain management

    • HVAC systems

    • Industrial robotics

    • Dialog systems

    • Autonomous vehicles

  • Yields fast on-line performance once the space has been explored


  • A specific implementation of reinforcement learning

  • You have:

    • A set of environmental states s

    • A set of possible actions in those states a

    • A value of each state/action Q

  • Start off with Q values of 0

  • Explore the space

  • As bad things happen after a given state/action, reduce its Q

  • As rewards happen after a given state/action, increase its Q

Q(s, a) += alpha * (reward(s,a) + max(Q(s_prime) - Q(s,a))
# where s is the previous state, a is the previous action, s_prime is the current state, and alpha is the discount factor (set to .5 here).

  • What are some state/actions here?

    • Pac-man has a wall to the West

    • Pac-man dies if he moves one step South

    • Pac-man just continues to live if going North or East

  • You can "look ahead" more than one step by using a discount factor when computing Q (here s is previous state, s' is current state)

    • We are 2 steps away from a power pill

    • So the Q value that I experience when I consume that power pill might actually give a boost to the previous Q values that I encountered along the way.

  • Q(s,a) += discount * (reward(s,a) + max(Q(s')) - Q(s,a))

The exploration problem

  • How do we efficiently explore all of the possible states?

    • Simple/naïve approach: always choose the action for a given state with the highest Q. If there's a tie, choose at random

    • But that's really inefficient, and you might miss a lot of paths that way

  • Better way: introduce an epsilon term

    • If a random number is less than epsilon, don't follow the highest Q, but choose at random

    • That way, exploration never totally stops

    • Let's us cover a much wider range of actions and states than we could otherwise

    • Choosing epsilon can be tricky

Fancy Words

  • Markov Decision Process

    • From Wikipedia: Markov decision processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

    • Sound familiar? MDP's are just a way to describe what we just did using mathematical notation.

    • States are still described as s and s'

    • State transition functions are described as 𝑃𝑎(s, s')

    • Our "Q" values are described as a reward function 𝑅𝑎(s, s')

  • Even fancier words! An MDP is a discrete time stochastic control process.


  • You can make an intelligent Pac-Man in a few steps:

    • Have it semi-randomly explore different choices of movement (actions) given different conditions (states)

    • Keep track of the reward or penalty associated with each choice for a given state/action (Q)

      • Can even propogate those rewards and penalties backward multiple steps if we want to make it even better
    • Use those stored Q values to inform its future choices

  • Pretty simple concept. But hey, now you can say you understand reinforcement learning, Q- learning, Markov Decision Processes, and Dynamic Programming!

Reinforcement Learning in SageMaker

  • Uses a deep learning framework with Tensorflow and MXNet

  • Supports Intel Coach and Ray Rllib toolkits.

  • Custom, open-source, or commercial environments supported.

    • MATLAB, Simulink

    • EnergyPlus, RoboSchool, PyBullet

    • Amazon Sumerian, AWS RoboMaker

Distributed Training with SageMaker RL

  • Can distribute training and/or environment rollout

  • Multi-core and multi-instance

Key Terms

  • Environment

    • The layout of the board / maze / etc
  • State

    • Where the player / pieces are
  • Action

    • Move in a given direction, etc
  • Reward

    • Value associated with the action from that state
  • Observation (state of the environment right now)

    • i.e., surroundings in a maze, state of chess board

Hyperparameter Tuning

  • Parameters of your choosing may be abstracted

    • Can make your own if you want to
  • Hyperparameter tuning in SageMaker can then optimize them

Instance Types

  • No specific guidance given in developer guide

  • But, it's deep learning - so GPU's are helpful

  • And we know it supports multiple instances and cores

Incremental learning

Incremental learning is a machine learning (ML) technique for extending the knowledge of an existing model by training it further on new data. Both of the Amazon SageMaker built-in visual recognition algorithms - Image Classification (opens in a new tab) and Object Detection (opens in a new tab) - will provide out of the box support for incremental learning.