Cloud & MLOps ☁️
SageMaker and Apache Spark

SageMaker and Apache Spark

You can integrate SageMaker and Spark.

  • Pre-process data as normal with Spark

    • Popular for pre-processing data
    • Generate DataFrames
      • load up your data into something called a data frame within spark and you can distribute the processing of that data frame to sort of manipulate and massage that data across an entire cluster on Spark.
  • Use sagemaker-spark library

    • Use the power of both

    • lets you use sage maker within a spark driver script.

  • SageMakerEstimator

    • KMeans, PCA, XGBoost

    • it looks a lot like just normal spark code if you look at it here, if you're familiar with Spark code, but instead of using a spark MLlib implementation, we're using a sage maker estimator in a sage make model instead.

    • The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution. You can't deploy SageMaker to an EMR cluster, and XGBoost actually requires LibSVM or CSV input, not RecordIO.

  • SageMakerModel to make inferences

  • Notebooks can use the SparkMagic (PySpark) kernel

Integrating SageMaker and Spark in Practice

  • Connect SageMaker notebook to a remote EMR cluster running Spark (or use Zeppelin)

  • Training dataframe should have:

    • A features column that is a vector of Doubles

    • An optional labels column of Doubles

  • Call fit on your SageMakerEstimator to get a SageMakerModel

  • Call transform on the SageMakerModel to make inferences

  • Works with Spark Pipelines as well

Why bother?

  • Allows you to combine pre- processing big data in Spark with training and inference in SageMaker.
  • EMR and SageMaker are now very tightly integrated.