Cloud & MLOps ☁️
AWS Data Pipeline

AWS Data Pipeline

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks. You define the parameters of your data transformations and AWS Data Pipeline enforces the logic that you've set up. It is a service to move data from one place to another, so it's an ETL service.

  • Popular destinations include S3, RDS, DynamoDB, Redshift and EMR

  • Manages task dependencies

    • It is just an orchestrator

    • The actual ETL doesn't happen within Data Pipeline, it happens in an EC2 instance managed by Data Pipeline

  • As an orchestrator, it has the capability to retry and notifies on failures

  • Data sources may be on-premises

  • Highly available

    • No reason to fail, and if it does, it will fall over onto another instance

Say you have an RDS database and it contains a data set you would like to perform some machine learning on. As such, you want to move it into an S3 Bucket so that you can use, for example, SageMaker. How do we move data from RDS into S3?

Here, we would spin up an AWS data pipeline and it would create an EC2 instance (or many) that will be managed by data pipeline. These EC2 instances would be tasked with moving data from RDS, all the way into the S3 Buckets. But this doesn't work just for RDS, it also works, for example, for Dynamo DB to do the exact same process.

AWS Data Pipeline vs Glue

Glue:

  • Glue ETL - Run Apache Spark code (Scala or Python based) focus on the ETL

  • Glue ETL - Do not worry about configuring or managing the resources

  • Data Catalog to make the data available to Athena or Redshift Spectrum

  • Here all the resources belong to AWS

Data Pipeline:

  • Orchestration service

  • More control over the environment, compute resources that run code, & code

  • Allows access to EC2 or EMR instances (creates resources in your own account)

They're both ETL services but glue is very Apache Spark focused, just ETL focused, making some transformations whereas data pipeline gives you a bit more control, is an orchestration service, and runs on EC2 instances or EMR instances from within your own accounts.