Cloud & MLOps ☁️
Full Data Engineering Pipeline Overview

Full Data Engineering Pipeline Overview

Real-Time Layer

  1. We first have producers, producing into Kinesis data stream and for example, will look at Kinesis data analytics on it to perform a real-time stream of analytics.

  2. We can use lambda to read from that stream and react in real-time or we can just have the new destination, stream into a Kinesis data stream itself or we can even send it to Kinesis data firehose if you wanted to ingest data somewhere.

  3. So, if it goes into the Kinesis data stream, maybe we'll have an application on EC2, they will read out that stream and do some analytics as well and so machine learning, it could definitely talk to Amazon SageMaker service to do some real time ML.

  4. If it's in Kinesis data firehose, it could be converted into ORC format to be sent to Amazon S3 and then loaded into Redshift or it could be sent in JSON format, to Amazon Elastic search service, for example.

  5. Also we don't necessarily have to produce two Kinesis data streams, we can also produce to Kinesis data firehose and from there, for example, send it in parquet format into an Amazon S3 and, as we all know already, we can connect Amazon Kinesis data firehose to Amazon Kinesis data analytics,

This just shows you a lot of different combinations we can have with Kinesis data streams, Kinesis data analytics, firehose and so on.

Video Layer

  1. We have video producers and for example, it could be a camera but also be a DeepLens device and we got sent into a Kinesis video stream.

    • Remember one stream per device.
  2. Maybe we'll have Rekognition that reads from that video stream and produces another Kinesis data stream that's full of data and from this we know we can have Amazon EC2, Kinesis data firehose, or Kinesis data analytics, just reading of that stream and doing what we need to do.

  3. Alternatively, we could just have Amazon EC2 instances reading of the Kinesis video stream and then talking to Sage maker to produce another Kinesis data stream that may be taken in by Lambda.

    • To react in real-time to notifications, for example when something looks wrong in a video.

Batch Layer

  1. The batch layer is about taking data and transforming it, and so on.

  2. So say your data lives on premise, in MySQL, and we first use database migration service to send it all the way to RDS. This is a one-to-one mapping, there is no data transformation, it just sends and replicates the data from on-premise to RDS.

  3. Then we want to analyze this data, so we'll use a data pipeline to take the data from RDS and maybe place it in an Amazon S3 bucket as a CSV format.

  4. We could do the exact same thing with Dynamo DB, going through the data pipeline and insert into JSON formats in your Amazon S3 Bucket.

  5. Then we need to transform the data with an ETL, into a format we like, so we can use glue ETL to do some ETL and also convert it at the very end into the parquet format into Amazon S3

    • Say our jobs create many, many different files, we could add an AWS batch task. That would be cleaning up your S3 buckets once a day.

    • So this would be a perfect use case for batch because batch wouldn't transform data, it would just clean up some files in your Amazon S3 buckets from time to time.

  6. Okay, so how do we orchestrate all these things? Well, step-functions would be great for that, to just make sure that the pipeline can be reactive, responsive, and can track all the executions of the entire different services,

  7. Now we have the data in so many different places, it would be great to have AWS glue data catalog to know what data is where and what is the schema - so as such - we will deploy crawlers into Dynamo DB, RDS, S3, and so on to create that data catalog and keep it updated with the schemas

Analytics layer

  1. Finally there is the analytics layer, so data is in Amazon S3 and we want to analyze it, so we could use EMR which is Hadoop, spark, hive, and so on

  2. We could use redshift or redshift spectrum to do data warehousing

  3. Or we could even create a data catalog to have all the data in S3 being indexed and this schema will be created and so we can use Amazon Athena in a server less fashion to analyze the data from S3

  4. Then when we have analyzed the data maybe we want to visualize it. And so we'll use QuickSight to do visualization on top of redshift or Athena.