SageMaker's Built-In Algorithms

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly.

In general, the protobuf recordIO format, used for training data, is the optimal way to load data into your model for training. When you use the protobuf recordIO format you can also take advantage of pipe mode when training your model. Pipe mode, used together with the protobuf recordIO format, gives you the best data load performance by streaming your data directly from S3 to your EBS volumes used by your training instance.

Linear Learner

Linear models are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labelled examples $(x, y)$ . $x$ is a high-dimensional vector and $y$ is a numeric label. For binary classification problems, the label must be either $0$ or $1$ .

For multiclass classification problems, the labels must be from 0 to num_classes - 1. For regression problems, y is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector x to an approximation of the label y.

The best model optimizes either of the following:

Continuous objectives, such as mean square error, cross entropy loss, absolute error.
Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy.

For training, the linear learner algorithm supports both recordIO-wrapped protobuf and CSV formats.

For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats.

For regression (predictor_type='regressor'), the score is the prediction produced by the model. For classification (predictor_type='binary_classifier' or predictor_type='multiclass_classifier'), the model returns a score and also a predicted_label. The predicted_label is the class predicted by the model and the score measures the strength of that prediction.

For Linear regression
- Fit a line to your training data
- Predications based on that line
Can handle both regression (numeric) predictions *and classification predictions
- For classification, a linear threshold function is used.
- Can do binary or multi-class

Linear Learner

What training input does it expect?

RecordIO-wrapped protobuf
- Float32 data only!
- Most performant option
CSV
- First column assumed to be the label
File or Pipe mode both supported
- Well in file mode in SageMaker it will copy all of your training data over as a single file all at once to every training instance in your training fleet, whereas pipe mode will actually pipe it and stream it in from S3, as needed.
  - Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances
- Obviously, pipe mode is gonna be more efficient especially with larger training sets, so, if you're having a problem where S3 is taking too long to train - like it's getting a hard time even getting started - a very simple optimization would be to use pipe mode instead of file mode.

How is it used?

Preprocessing
- Training data must be normalized (so all features are weighted the same)
- Linear Learner can do this for you automatically
- Input data should be shuffled
Training
- Uses stochastic gradient descent
- Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
- Multiple models are optimized in parallel. Chooses the most optimal one at the validation step.
- Tune L1, L2 regularization to prevent overfitting
  - L1 is ending up doing feature selection, whereas L2, is sort of just weighting your individual features more smoothly
Validation
- Most optimal model is selected

Important Hyperparameters

Balance_multiclass_weights: Gives each class equal importance in loss functions
Learning_rate, mini_batch_size
L1 Regularization
$W_d$ : Weight decay (L2 regularization)

Instance Types

Training
- Single or multi-machine CPU or GPU
- Multi-GPU does not help
  - it does help to have more than one machine but it does not help to have more than one GPU on one machine

XGBoost

eXtreme Gradient Boosting
- Boosted group of decision trees
- New trees made to correct the errors of previous trees
- Uses gradient descent to minimize loss as new trees are added
It's been winning a lot of Kaggle competitions
- And it's fast, too
Can be used for classification
And also for regression
- Using regression trees for numerical values

What training input does it expect?

XGBoost is weird, since it's not made for SageMaker.
- It's just open source XGBoost
So, it takes CSV or libsvm input.
AWS recently extended it to accept recordIO-protobuf and Parquet as well.

How is it used?

Models are serialized/deserialized with Pickle with python
Can use as a framework within notebooks
- Sagemaker.xgboost
- You don't even have to deploy it to training hosts or you don't even have to, like, use a docker image with it.
Or as a built-in SageMaker algorithm
- refer to the XGBoost Docker image in ECR and deploy that to a fleet of training host to do larger scale training jobs.

Important Hyperparameters

There are a lot of them. A few important ones are:

Subsample
- Prevents overfitting
eta
- Step size shrinkage, prevents overfitting
gamma
- Minimum loss reduction to create a partition; larger = more conservative
alpha
- L1 regularization term; larger = more conservative
lambda
- L2 regularization term; larger = more conservative
eval_metric
- set the metric that you're optimizing on while training.
- Optimize on AUC (for False Positives), error, rmse...
- For example, if you care about false positives more than accuracy, you might use AUC here
scale_pos_weight
- Adjusts balance of positive and negative weights
- Helpful for unbalanced classes
- Might set to sum(negative cases) / sum(positive cases)
max_depth
- Max depth of the tree
- Too high and you may overfit

Instance Types

Uses CPU's only for multiple instance training
- historically XG Boost was a C.P.U only algorithm
- Still only choice if you want to do multiple instances across a cluster
Is memory-bound, not compute- bound
- So, M5 is a good choice
As of XGBoost 1.2, single-instance GPU training is available
- For example P3
- Must set tree_method hyperparameter to gpu_hist
- Trains more quickly and can be more cost effective.

Seq2Seq

What's it for?

Input is a sequence of tokens, output is a sequence of tokens
Machine Translation
Text summarization
Speech to text
Implemented with RNN's and CNN's with attention

What training input does it expect?

RecordIO-Protobuf
- Tokens must be integers (this is unusual, since most algorithms want floating point data)
Start with tokenized text files
- you can't just pass in a text file full of words. You need to actually build a vocabulary file that maps every word to a number
Convert to protobuf using sample code
- Packs into integer tensors with vocabulary files
- A lot like the TF/IDF lab we did earlier.
Must provide training data, validation data, and vocabulary files.

How is it used?

Training for machine translation can take days, even on SageMaker
Pre-trained models are available
- See the example notebook
Public training datasets are available for specific translation tasks

Important Hyperparameters

Batch_size
Optimizer_type (adam, sgd, rmsprop)
Learning_rate
Num_layers_encoder
Num_layers_decoder
Can optimize on:
- Accuracy
  - Vs. provided validation dataset
- BLEU score
  - Compares against multiple reference translations
- Perplexity
  - Cross-entropy

Instance Types

As a deep learning algorithm, it takes advantage of GPUs and you should just to use GPUs
Can only use GPU instance types (P3 for example)
Can only use a single machine for training
- Cannot parallelize
- But can use multi-GPU's on one machine

DeepAR Forecasting Algorithm

The Amazon SageMaker DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future.

What's it for?

Forecasting one-dimensional time series data
Uses RNN's
Allows you to train the same model over several related time series
- Can have many input timeseries
Finds frequencies and seasonality

What training input does it expect?

JSON lines format
- Gzip or Parquet
Each record must contain:
- Start: the starting time stamp
- Target: the time series values
Each record can contain:
- Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series of product purchases)
- Cat: categorical features

{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1], "dynamic_feat": [[1.1, 1.2, 0.5, ...]]}
 
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat": [[1.1, 2.05, ...]]}
 
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat": [[1.3, 0.4]]}

How is it used?

Always include entire time series for training, testing, and inference
Use entire dataset as training set, remove last time points for testing. Evaluate on withheld values.
Don't use very large values for prediction length (> 400)
Train on many time series and not just one when possible

Important Hyperparameters

Context_length
- Number of time points the model sees before making a prediction
- Can be smaller than seasonality's; the model will lag one year anyhow.
- Can pick up seasonal trends based on historical data
Epochs
mini_batch_size
Learning_rate
Num_cells

Instance Types

Can use CPU or GPU
Single or multi machine
Start with CPU (C4.2xlarge, C4.4xlarge)
Move up to GPU if necessary
- Only helps with larger models
CPU-only for inference
May need larger instances for hyperparameter tuning
- One good thing though is that during training you can use single or multiple machines, so it's easy to scale this out if you need to.

BlazingText

What's it for?

Text classification
- Predict labels for a sentence
- Useful in web searches, information retrieval
- Supervised
Word2vec
- Creates a vector representation of words
- Semantically similar words are represented by vectors close to each other
- This is called a word embedding
- It is useful for NLP, but is not an NLP algorithm in itself!
- Used in machine translation, sentiment analysis
- Remember it only works on individual words, not sentences or documents

What training input does it expect?

For supervised mode (text classification):
- One sentence per line
- First "word" in the sentence is the string _label_ followed by the label
- Notice we make it all lowercase and add spaces around all punctuation

__label__4 linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2 bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .

Also accepts "augmented manifest text format"

{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
 
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}

For Word2Vec just wants a text file with one training sentence per line.

How is it used?

Word2vec has multiple modes
- Cbow (Continuous Bag of Words)
  - Actual order of words is thrown out.
- Skip-gram
  - Can be distributed computation over many CPU nodes
- Batch skip-gram
  - Can be distributed computation over many CPU nodes

Important Hyperparameters

Word2vec:
- Mode (batch_skipgram, skipgram, cbow)
- Learning_rate
- Window_size
- Vector_dim
- Negative_samples
Text classification:
- Epochs
- Learning_rate
- Word_ngrams
- Vector_dim

Instance Types

For cbow and skipgram, recommend a single ml.p3.2xlarge
- Any single CPU or single GPU instance will work
For batch_skipgram, can use single or multiple CPU instances
- Can scale horizontally this way
For text classification, C5 recommended if less than 2GB training data. For larger data sets, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)

Object2Vec

What's it for?

Remember word2vec from Blazing Text? It's like that, but arbitrary objects
- More general purpose
- It creates low-dimensional dense embeddings of high-dimensional objects
It is basically word2vec, generalized to handle things other than words.
Compute nearest neighbors of objects
Can visualize clusters
Genre prediction
Recommendations (similar items or users)

Can see similar items in this embedding space.

What training input does it expect?

Data must be tokenized into integers
Training data consists of pairs of tokens and/or sequences of tokens
- Sentence - sentence
  - Find relationship between things based on these paired attributes
- Labels-sequence (genre to description?)
- Customer-customer
- Product-product
- User-item

{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
 
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
 
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}

How is it used?

Process data into JSON Lines and shuffle it
Train with two input channels, two encoders, and a comparator
Encoder choices (for each input path):
- Average-pooled embeddings
- CNN's
- Bidirectional LSTM
Comparator is followed by a feed-forward neural network

Important Hyperparameters

The usual deep learning ones...
- Dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay
Enc1_network, enc2_network
- Choose hcnn, bilstm, pooled_embedding

Instance Types

Can only train on a single machine (CPU or GPU, multi-GPU OK)
- Ml.m5.2xlarge
- Ml.p2.xlarge
- If needed, go up to ml.m5.4xlarge or ml.m5.12xlarge
Inference: use ml.p2.2xlarge
- Use INFERENCE_PREFERRED_MODE environment variable when you're setting up the image for that inference
- So, this environment variable allows you to optimize for encoder embeddings, which is what object2vec does, as opposed to classification or regression
- environment variable to optimize for encoder embeddings rather than classification or regression.

Object Detection

What's it for?

Identify all objects in an image with bounding boxes
Detects and classifies objects with a single deep neural network
Classes are accompanied by confidence scores
Can train from scratch (maybe use ground truth), or use pre- trained models based on ImageNet

Detected-with-YOLO

What training input does it expect?

RecordIO or image format (jpg or png)
With image format, supply a JSON file for annotation data for each image during training e.g.:

{
  "file": "your_image_directory/sample_image1.jpg",
  "image_size": [
    {
      "width": 500,
      "height": 400,
      "depth": 3
    }
  ],
  "annotations": [
    {
      "class_id": 0,
      "left": 111,
      "top": 134,
      "width": 61,
      "height": 128
    }
  ],
  "categories": [
    {
      "class_id": 0,
      "name": "dog"
    }
  ]
}

How is it used?

Takes an image as input, outputs all instances of objects in the image with categories and confidence scores
Uses a CNN with the Single Shot multibox Detector (SSD) algorithm
- The base CNN can be VGG-16 or ResNet-50
Transfer learning mode / incremental training
- Use a pre-trained model for the base network weights, instead of random initial weights
- Continue train this model further
Uses flip, rescale, and jitter internally to avoid overfitting

Important Hyperparameters

Mini_batch_size
Learning_rate
Optimizer
- Sgd, adam, rmsprop, adadelta

Instance Types

Use GPU instances for training (multi-GPU and multi-machine OK)
- Ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8clarge, ml.p3.16xlarge
Use CPU or GPU for inference
C5, M5, P2, P3 all OK

Image Classification

What is it for?

Assign one or more labels to an image
Doesn't tell you where objects are, just what objects are in the image

What training input does it expect?

Apache MXNet RecordIO
- Not protobuf!
- This is for interoperability with other deep learning frameworks.
Or, raw jpg or png images
Image format requires .lst files to associate image index, class label, and path to the image
Augmented Manifest Image Format enables Pipe mode
- Allows it to stream that data in from S3 as opposed to copying everything over

Example of .lst file:

5 1 your_image_directory/train_img_dog1.jpg
1000 0 your_image_directory/train_img_cat1.jpg
22 1 your_image_directory/train_img_dog2.jpg

Augmented Manifest Image Format:

{"source-ref":"s3://image/filename1.jpg", "class":"0"}
 
{"source-ref":"s3://image/filename2.jpg", "class":"1", "class-metadata": {"class-name": "cat", "type" : "groundtruth/image- classification"}}

How is it used?

ResNet CNN under the hood
Full training mode
- Network initialized with random weights
Transfer learning mode
- Initialized with pre-trained weights
- The top fully-connected layer is initialized with
- random weights
- Network is fine-tuned with new training data
Default image size is 3-channel 224x224 (ImageNet's dataset)

Important Hyperparameters

The usual suspects for deep learning
- Batch size, learning rate, optimizer choice
Optimizer-specific parameters
- Weight decay, beta 1, beta 2, eps, gamma

Instance Types

GPU instances for training (P2, P3) Multi-GPU and multi-machine OK.
CPU or GPU for inference (C4, P2, P3)

Semantic Segmentation Algorithm

The Amazon SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.

What's it for?

Pixel-level object classification
Different from image classification - that assigns labels to whole images
Different from object detection - that assigns labels to bounding boxes
Useful for self-driving vehicles, medical imaging diagnostics, robot sensing
Produces a segmentation mask

What training input does it expect?

JPG Images and PNG annotations
For both training and validation
Label maps to describe annotations in English
Augmented manifest image format supported for Pipe mode.
JPG images accepted for inference

How is it used?

Built on MXNet Gluon and Gluon CV
Choice of 3 algorithms:
- Fully-Convolutional Network (FCN)
- Pyramid Scene Parsing (PSP)
- DeepLabV3
Choice of backbones:
- ResNet50
- ResNet101
- Both trained on ImageNet
Incremental training, or training from scratch, supported too

Important Hyperparameters

Epochs, learning rate, batch size, optimizer, etc
Algorithm
Backbone

Instance Types

More restrictive: Only GPU supported for training (P2 or P3) on a single machine only
- Specifically ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, or ml.p3.16xlarge
- Cannot be parallelized
Inference on CPU (C5 or M5) or GPU (P2 or P3)

Random Cut Forest (RCF) Algorithm

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

What's it for?

Anomaly detection
Unsupervised
Detect unexpected spikes in time series data
Breaks in periodicity
Unclassifiable data points
Assigns an anomaly score to each data point
Based on an algorithm developed by Amazon that they seem to be very proud of!

What training input does it expect?

RecordIO-protobuf or CSV
Can use File or Pipe mode on either case
Optional test channel for computing accuracy, precision, recall, and F1 on labeled data (anomaly or not)
- Unsupervised but can do for testing

How is it used?

Creates a forest of trees where each decision tree is a partition of the training data
- What it does is it looks at expected change in complexity of the tree as a result of adding a point into it
- So if you add a new data point into this decision tree and it causes a whole bunch of branches to form off, it says well this might be anomalous
- They're basically using the properties of a decision tree and saying, okay, the fact that a decision tree needs to make a bunch of new branches to accommodate the new data, probably means there's something weird about that data point.
Data is sampled randomly
Then trained
RCF shows up in Kinesis Analytics as well; it can work on streaming data too.

Important Hyperparameters

Num_trees
- Increasing reduces noise
Num_samples_per_tree
- Should be chosen such that 1/num_samples_per_tree approximates the ratio of anomalous to normal data

Instance Types

Does not take advantage of GPUs
Use M4, C4, or C5 for training
ml.c5.xl for inference

Neural Topic Model (NTM) Algorithm

Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a latent representation because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

What's it for?

Organize documents into topics
Classify or summarize documents based on topics
It's not just TF/IDF
- "bike", "car", "train", "mileage", and "speed" might classify a document as "transportation" for example (although it wouldn't know to call it that)
Unsupervised methods
- Algorithm is "Neural Variational Inference"

What training input does it expect?

Four data channels
- "train" is required - not really training though as it is unsupervised
- "validation", "test", and "auxiliary" optional
recordIO-protobuf or CSV
Words must be tokenized into integers
- those words first must be tokenized into integers
- you don't just pass in raw text, first, you have to actually break up and convert those documents into tokens for each word and also pass into a vocabulary file that maps those words to the numbers
- The "auxiliary" channel is for the vocabulary
File or pipe mode

How is it used?

You define how many topics you want
These topics are a latent representation based on top ranking words
One of two topic modeling algorithms in SageMaker - you can try them both!

Important Hyperparameters

Lowering mini_batch_size and learning_rate can reduce validation loss
- At expense of training time
Num_topics

Instance Types

GPU or CPU
- GPU recommended for training
- CPU OK for inference
- CPU is cheaper

LDA

What's it for?

Latent Dirichlet Allocation
Another topic modeling algorithm
- Not deep learning
Unsupervised
- The topics themselves are unlabeled; they are just groupings of documents with a shared subset of words
Can be used for things other than words
- Cluster customers based on purchases
- Harmonic analysis in music

What training input does it expect?

Train channel, optional test channel
- If you want to measure accuracy
- It is unsupervised, so doesn't really help
recordIO-protobuf or CSV
Need to tokenize everything
- Integers coded for each word
- Each document has counts for every word in vocabulary (in CSV format)
Pipe mode only supported with recordIO

How is it used?

Unsupervised; generates however many topics you specify
Optional test channel can be used for scoring results
- Per-word log likelihood
- Metric for measuring how well LDA does
Functionally similar to NTM, but CPU-based
Therefore maybe cheaper / more efficient

Important Hyperparameters

Num_topics
Alpha0
- Initial guess for 'concentration parameter'
- Smaller values generate sparse topic mixtures
- Larger values (>1.0) produce uniform mixtures

Instance Types

Single-instance CPU training
- Cannot parallelise training or use GPU

KNN

What's it for?

K-Nearest-Neighbors
Supervised
Simple classification or regression algorithm
Classification
- Find the K closest points to a sample point and return the most frequent label
Regression
- Find the K closest points to a sample point and return the average value

What training input does it expect?

Train channel contains your data
Test channel emits accuracy or MSE
- If you want to measure it
recordIO-protobuf or CSV training
- First column is label
File or pipe mode on either

How is it used?

SageMaker takers it to the next level
Data is first sampled
SageMaker includes a dimensionality reduction stage
- Avoid sparse data ("curse of dimensionality")
- At cost of noise / accuracy
- "sign" or "fjlt" methods
Build an index for looking up neighbors
Serialize the model
Query the model for a given K

Important Hyperparameters

K!
Sample_size

Instance Types

Training on CPU or GPU
- Ml.m5.2xlarge
- Ml.p2.xlarge
Inference
- CPU for lower latency
- GPU for higher throughput on large batches

K-Means Algorithm

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.

Amazon SageMaker uses a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, the version used by Amazon SageMaker streams mini-batches (small, random subsets) of the training data.

What's it for?

Unsupervised clustering technique
Divide data into K groups, where members of a group are as similar as possible to each other
- You define what "similar" means
- Measured by Euclidean distance
Sagemaker brings Web-scale K-Means clustering
- At large scale this is difficult at scale
- Can now do at large scale

What training input does it expect?

Train channel, optional test
Train ShardedByS3Key flag, test FullyReplicated
recordIO-protobuf or CSV
File or Pipe on either

How is it used?

Every observation mapped to n-dimensional space (n = number of features)
Works to optimize the center of K clusters
- "extra cluster centers" may be specified to improve accuracy (which end up getting reduced to k)
- K = k*x
  - Uses more clusters at the beginning and consolidates it down to the number you want over time
Algorithm:
- Determine initial cluster centers
  - Random or k-means++ approach
  - K-means++ tries to make initial clusters far apart
- Iterate over training data and calculate cluster centers
- Reduce clusters from K to k
  - Using Lloyd's method with kmeans++ to do that reduction

Important Hyperparameters

K!
- Choosing K is tricky
- Plot within-cluster sum of squares as function of K
- Use "elbow method"
- Basically optimize for tightness of clusters
Mini_batch_size
Extra_center_factor
Init_method
- Random or kmeans++

Instance Types

CPU or GPU, but CPU recommended
- Only one GPU per instance used on GPU So use p*.xlarge if you're going to use GPU

PCA

What's it for?

Principal Component Analysis
Dimensionality reduction
- Project higher-dimensional data (lots of features) into lower-dimensional (like a 2D plot) while minimizing loss of information
- The reduced dimensions are called components
  - First component has largest possible variability
  - Second component has the next largest...
Unsupervised

What training input does it expect?

recordIO-protobuf or CSV
File or Pipe on either

How is it used?

Covariance matrix is created, then singular value decomposition (SVD)
Two modes
- Regular
  - For sparse data and moderate number of observations and features
- Randomized
  - For large number of observations and features
  - Uses approximation algorithm and scales better.

Important Hyperparameters

Algorithm_mode
- Regular or Randomised
Subtract_mean
- Unbiases the data upfront

Instance Types

GPU or CPU
It depends "on the specifics of the input data"

Factorization Machines

What's it for?

Specialize in classification or regression with with sparse data
- Item recommendations
  - Predict a item a user may like. A user doesn't interact with most of them
- Click prediction
  - Probably wont interact with most pages on the internet
- Since an individual user doesn't interact with most pages / products the data is sparse
Supervised Method
Classification or regression
- you might be saying I want to predict whether this person likes this product or not or you might want to actually do a regression of a specific rating value that they might assigned to that item.
Limited to pair-wise interactions
- User -> item for example

What training input does it expect?

recordIO-protobuf with Float32
Sparse data means CSV isn't practical
- Most will be commas as we are dealing with sparse data here

How is it used?

Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?)
Usually used in the context of recommender systems
That's what factorization machines do is trying to find these factors of the matrix that we can use to multiply together to figure out, well, given this matrix of a user's items that they liked, what do we think the resulting ratings would be for things they haven't seen yet

Important Hyperparameters

Initialization methods for bias, factors, and linear terms
- Uniform, normal, or constant
- Can tune properties of each method

Instance Types

CPU or GPU
CPU recommended
GPU only works with dense data

IP Insights

What's it for?

All about finding fishy behavior in your web logs
- Security tool to flag up suspicious behavior
Unsupervised learning of IP address usage patterns
Identifies suspicious behavior from IP addresses
- Identify logins from anomalous IP's
- Identify accounts creating resources from anomalous IP's

What training input does it expect?

User names, account ID's can be fed in directly; no need to pre-process a lot
Training channel, optional validation (computes AUC score)
CSV only
- Entity, IP

How is it used?

Uses a neural network to learn latent vector representations of entities and IP addresses.
- try to learn what specific IP addresses do.
Entities are hashed and embedded
- Need sufficiently large hash size
Automatically generates negative samples during training by randomly pairing entities and IP's
- For the case when we have too few training points.

Important Hyperparameters

Num_entity_vectors
- Hash size
- Set to twice the number of unique entity identifiers
Vector_dim
- Size of embedding vectors
- Scales model size
- Too large results in overfitting
Epochs, learning rate, batch size, etc.

Instance Types

CPU or GPU
- GPU recommended
- Ml.p3.2xlarge or higher
- Can use multiple GPU's
- Size of CPU instance depends on vector_dim and num_entity_vectors

Reinforcement Learning

You have some sort of agent that "explores" some space
As it goes, it learns the value of different state changes in different conditions
Those values inform subsequent behavior of the agent
- it just learns for a given position within this environment and a given set of things around me, what's the best thing to do
- Explores the space over time
Examples: Pac-Man, Cat & Mouse game (game AI)
- Supply chain management
- HVAC systems
- Industrial robotics
- Dialog systems
- Autonomous vehicles
Yields fast on-line performance once the space has been explored

Q-Learning

A specific implementation of reinforcement learning
You have:
- A set of environmental states s
- A set of possible actions in those states a
- A value of each state/action Q
Start off with Q values of 0
Explore the space
As bad things happen after a given state/action, reduce its Q
As rewards happen after a given state/action, increase its Q

 
Q(s, a) += alpha * (reward(s,a) + max(Q(s_prime) - Q(s,a))
# where s is the previous state, a is the previous action, s_prime is the current state, and alpha is the discount factor (set to .5 here).

What are some state/actions here?
- Pac-man has a wall to the West
- Pac-man dies if he moves one step South
- Pac-man just continues to live if going North or East
You can "look ahead" more than one step by using a discount factor when computing Q (here s is previous state, s' is current state)
- We are 2 steps away from a power pill
- So the Q value that I experience when I consume that power pill might actually give a boost to the previous Q values that I encountered along the way.
Q(s,a) += discount * (reward(s,a) + max(Q(s')) - Q(s,a))

The exploration problem

How do we efficiently explore all of the possible states?
- Simple/naïve approach: always choose the action for a given state with the highest Q. If there's a tie, choose at random
- But that's really inefficient, and you might miss a lot of paths that way
Better way: introduce an epsilon term
- If a random number is less than epsilon, don't follow the highest Q, but choose at random
- That way, exploration never totally stops
- Let's us cover a much wider range of actions and states than we could otherwise
- Choosing epsilon can be tricky

Fancy Words

Markov Decision Process
- From Wikipedia: Markov decision processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
- Sound familiar? MDP's are just a way to describe what we just did using mathematical notation.
- States are still described as s and s'
- State transition functions are described as 𝑃𝑎(s, s')
- Our "Q" values are described as a reward function 𝑅𝑎(s, s')
Even fancier words! An MDP is a discrete time stochastic control process.

Recap

You can make an intelligent Pac-Man in a few steps:
- Have it semi-randomly explore different choices of movement (actions) given different conditions (states)
- Keep track of the reward or penalty associated with each choice for a given state/action (Q)
  - Can even propogate those rewards and penalties backward multiple steps if we want to make it even better
- Use those stored Q values to inform its future choices
Pretty simple concept. But hey, now you can say you understand reinforcement learning, Q- learning, Markov Decision Processes, and Dynamic Programming!

Reinforcement Learning in SageMaker

Uses a deep learning framework with Tensorflow and MXNet
Supports Intel Coach and Ray Rllib toolkits.
Custom, open-source, or commercial environments supported.
- MATLAB, Simulink
- EnergyPlus, RoboSchool, PyBullet
- Amazon Sumerian, AWS RoboMaker

Distributed Training with SageMaker RL

Can distribute training and/or environment rollout
Multi-core and multi-instance

Key Terms

Environment
- The layout of the board / maze / etc
State
- Where the player / pieces are
Action
- Move in a given direction, etc
Reward
- Value associated with the action from that state
Observation (state of the environment right now)
- i.e., surroundings in a maze, state of chess board

Hyperparameter Tuning

Parameters of your choosing may be abstracted
- Can make your own if you want to
Hyperparameter tuning in SageMaker can then optimize them

Instance Types

No specific guidance given in developer guide
But, it's deep learning - so GPU's are helpful
And we know it supports multiple instances and cores

Incremental learning

Incremental learning is a machine learning (ML) technique for extending the knowledge of an existing model by training it further on new data. Both of the Amazon SageMaker built-in visual recognition algorithms - Image Classification (opens in a new tab) and Object Detection (opens in a new tab) - will provide out of the box support for incremental learning.

Now easily perform incremental learning on Amazon SageMaker (opens in a new tab)

Amazon SageMaker Automatic Model Tuning