Cloud & MLOps ☁️
AWS Kinesis

AWS Kinesis

  • AWS Kinesis is a streaming service and is a managed alternative to Apache Kafka
  • Great for gathering application logs, metrics, IoT information, clickstreams
  • Great for "real-time" big data
  • Great for streaming processing frameworks (Spark, NiFi, etc...)
  • Data is automatically replicated synchronously to 3 AZ
    • Data is safe using this Realtime service.

There are 4 Services we need to be aware of under the kinesis umbrella:

  1. Kinesis Streams: low latency streaming ingest at scale. Create real-time machine learning applications
  2. Kinesis Analytics: perform real-time analytics on streams using SQL language
  3. Kinesis Firehose: ETL tool to store the data. Load streams into S3, Redshift, ElasticSearch & Splunk. Ingest massive data near-real time. Can be preprocessed using Lambda.
  4. Kinesis Video Streams: meant for streaming video in real-time. Can be used for ML applications

From an architectural standpoint this is what Kinesis looks like:

Kinesis streams will onboard a lot of data coming from click streams, IoT devices, metrics, and logs etc. Then we may want to perform some analytics real-time on it. So we'll use Kinesis analytics as a service. And then finally, we'll use Amazon Kinesis Firehose to take all the data from these analytics and maybe insert it into Amazon S3 buckets or Amazon Redshift for you doing your data warehousing.

This is the general idea, we onboard real time data, do some analytics in real time, then we put it into stores like S3 or Redshift for us to perform deeper analytics or reporting etc.

AWS Kinesis Streams

Shards to carry data. Has unique partition keys. 1-500 shards by default. Interaction can be via a producer library (KPL), client library (KCL) or API (AWS SDK). Used mainly when not needed to store data, but rather would like some transformations on top of the data or when you want to feed the data to another AWS service. Retention period is between 1-7 days. Uses AWS KMS. Streams are divided into ordered Shards / Partitions:

  • For example, a stream can be made of three shards, and your producer produces these shards and your consumer consumes from these shards.
  • In Kinesis Stream, you have data retention at 24 hours by default, and this can go up to 365 days.
    • This gives you the ability to reprocess / replay data because it is not gone after you consume it.
  • Multiple applications can consume the same stream, which allows you to get a great pub/sub type of architecture.
    • Once the data is inserted into the Kinesis stream, it can't be deleted (immutability)
  • Records can be up to 1MB in size.
    • Kinesis data streams is a great use cases for a small amount of data going fast through your stream, but not for petabyte scale batch analysis

Kinesis Data Streams: Capacity Modes

Provisioned mode:

  • You choose the number of shards provisioned, and scale them manually or using API

  • Each shard gets 1MB/s in (or 1000 records per second)

  • Each shard gets 2MB/s out (classic or enhanced fan-out consumer)

  • You pay per shard provisioned per hour

    • Better if you can plan the capacity you will need in advance

On-demand mode:

  • No need to provision or manage the capacity

    • the capacity will be adjusted over time on-demand.
  • Default capacity provisioned (4 MB/s in or 4000 records per second)

  • Scales automatically based on observed throughput peak during the last 30 days

  • Pay per stream per hour & data in/out per GB

    • This is a different pricing model

    • Better if you don't know the capacity in advance

Kinesis Data Streams Limits to know

  • Producers:

    • Can send 1MB/s or 1000 messages/s at write time PER SHARD

    • "ProvisionedThroughputException" otherwise if you go over this

  • Consumer Classic:

    • Will read at 2MB/s at read PER SHARD across all consumers

    • Maximum 5 API calls per second PER SHARD across all consumers

      • what that means is that the more capacity you want, if you want to have more and more capacity in your streams, you need to add shards.

      • So Kinesis data streams only scales if you add shards over time.

  • Data Retention:

    • 24 hours data retention by default

    • Can be extended to 365 days

Kinesis Streams is the go to option for real time streaming applications.

Kinesis Data Firehose

Firehose is used to store data into target destinations. We have depicted producers on the left and it could be, again, a bunch of applications that can either send data directly into Kinesis data firehose or Kinesis data firehose can read from your Kinesis data stream, from Amazon CloudWatch or AWS IoT etc.

But the most common you're going to see is going to be firehose reading from Kinesis data streams.

It will read the records one by one up to one megabyte at a time. Then, if you want to transform the record, we could have a lambda function created to transform the record to do some spawn modification, and then Kinesis data firehose will try to fill a big batch of data to write that data into a target database or a target destination. So that's why it's called batch writes. It doesn't write stuff instantaneously; it will try to batch them to write them efficiently. So therefore, Kinesis data firehose is a near real-time service.

In terms of destinations, we have S3, Amazon Redshift (to write to Redshift, Kinesis data firehose will first write into Amazon S3 and then issue a copy command into Redshift) or Amazon ElasticSearch.

Other destinations include third party destinations from partners such as Datadog, Splunk, New Relic, and MongoDB, but there could be others coming up in the future.

Or also, there could be a custom destination as long as there is a valid HTTP endpoint as an API. You can have firehose deliver data into that destination.

Finally, if you have failure for processing's or you want to archive all data going to firehose, it is very much possible to send all or just a failed data into a backup S3 bucket from Kinesis data firehose.

Firehose is a:

  • Fully Managed Service, no administration

  • Near Real Time (60 seconds latency minimum for non full batches)

  • Data Ingestion into Redshift / Amazon S3 / ElasticSearch / Splunk

    • Redshift data warehousing service

    • ElasticSearch is a whole service to index data

    • Splunk is an external third party service

  • Automatic scaling

    • No capacity to specify in advance
  • Supports many data formats

  • Data Conversions from CSV / JSON to Parquet / ORC (only for S3)

  • Data Transformation through AWS Lambda (ex: CSV \rightarrow JSON)

  • Supports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)

  • You only pay for the amount of data going through Firehose

Kinesis Data Firehose Delivery Diagram:

We have our source, our delivery stream, and then an optional lambda function to perform some data transformation if we need to. We send the output to Amazon S3, and maybe it will even go all the way to Redshift by doing a copy command. And then, the source records can be sent into an Amazon S3, or the transformation failures, or the delivery failures, so there is a way to recover from failure by putting all the source records, transformation failures, and delivery failures into another Amazon S3 bucket.

Kinesis Data Streams vs Firehose

Streams

  • For building real time services

  • Going to write custom code for the producer & consumer, and this allows us to create real time applications.

  • Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)

  • Automatic scaling with On-demand Mode of Kinesis streams

  • Data Storage for the stream is 1 to 365 days

    • replay capability - replay based on the data in memory

    • multi consumers applications

    • multiple real-time applications reading from that stream

Firehose

  • it is a delivery service - an INGESTION service

  • Fully managed, send to S3, Splunk, Redshift, ElasticSearch

  • Serverless data transformations with Lambda

  • Near real time (lowest buffer time is 1 minute)

  • Automated Scaling

    • Don't need to provision capacity in advance
  • No data storage

  • Can convert record format into Apache Parquet or Apache ORC

    • Data in Apache Parquet or Apache ORC format is typically more efficient to query than JSON. Kinesis Data Firehose can convert your JSON-formatted source records using a schema from a table defined in AWS Glue . For records that aren't in JSON format, create a Lambda function that converts them to JSON in the Transform source records with AWS Lambda section above.

Kinesis Data Analytics

Now that we have put our data into Kinesis, we may want to perform some real-time analytics on it. And for this, we can use Kinesis analytics. So conceptually kinesis data analytics will take data either from Kinesis data streams, or, kinesis data firehose.

We'll perform some cool stuff in SQL language, and then the output of it can be sent to Analytics tools or output destinations.

In more detail:

Input streams can be either a Kinesis data stream or a Kinesis data firehose, and this will go into the inputs of our Kinesis Analytics. In here we're able to set a SQL statement, to define how we want to change or modify that stream; perform some aggregation, some count, some windowing, and so on - then we are able to join it with some reference data and that reference table comes straight from an Amazon S3 Buckets.

So if we had some reference data into it this S3 bucket, we could join the input stream to a lookup reference table. This stream will do a lot of things for us, and then out of it, we have an output stream and an error stream. This output stream can go into many different places and can go into Kinesis data streams, Kinesis data firehose, lambda, and then through Kinesis data firehose we can send it all the way to Amazon S3, or Redshift.

Use cases:

  • Streaming ETL: select columns, make simple transformations, on streaming data

    • Streaming ETL (Extract, Transform, Load) is the processing and movement of real-time data from one place to another. ETL is short for the database functions extract, transform, and load.

What is ETL? (Extract Transform Load) | Informatica Canada

  • Can reduce the size of our data sets by selecting columns, making simple transformations

  • Continuous metric generation: live leaderboard for a mobile game

  • Responsive analytics: look for certain criteria and build alerting (filtering input data set and looking at anomolies)

Features:

  • Pay only for resources consumed (but it's not cheap)

  • Serverless; scales automatically.

    • No capacity planning is needed
  • Use IAM permissions to access streaming source and destination(s)

  • SQL or Flink to write the computation

    • Apache Flink applications: they're very powerful, they actually way more powerful than the SQL applications, you can do a lot more with it.

    • Apache Flink is now becoming a standard in the streaming analytics space.

  • Schema discovery

  • Lambda can be used for pre-processing

Machine Learning on Kinesis Data Analytics

There are two algorithms that you can do directly in Kinesis Data Analytics and you need to remember them:

  1. RANDOM_CUT_FOREST

    • Anomaly detection on a numeric column in a stream.

    • If we look at the diagrams below, we can see most of the data points are in the center but four data points in red are outside the center, and so they look like anomalies, and so this random cut forest algorithm will detect these anomalies.

  • You need to remember that this random cut forest is an algorithm that adapts over time, so it only uses recent history to compute the model
  1. HOTSPOTS:

    • locates and return information about relatively dense regions in your data

    • Example: a collection of overheated servers in a data center

    • allows you to locate and return information about relatively deep, dense regions in your data.

    • Not as ever-changing

These two algorithms are very very different; the first one is to detect anomalies, the second one is to detect dense areas, and you have to remember that the first one allows you to use only the recent history - so there is an ever-changing model, the second one is less ever-changing and allows you to detect dense locations.

Kinesis Video Streams

  • Producers:

    • Security camera, body-worn camera, AWS DeepLens, smartphone camera, audio feeds, images, RADAR data, RTSP camera etc.
  • Convention is to have one producer per video stream

    • If you have 1000 cameras, you will have 1000 video streams
  • Video playback capability

    • Can show live feed to applications / users
  • Consumers

    • build your own (MXNet, Tensorflow)

    • AWS SageMaker

    • Amazon Rekognition Video

  • Can keep data for 1 hour to 10 years

Kinesis Video Streams use cases from Amazon blog (opens in a new tab):

  1. Kinesis video stream is consumed in real-time by an application that is running in a docker container on Fargate

    • Could be EC2 aswell
  2. That application will be checkpointing the progress of the stream conception into DynamoDB, because it is a consumer so it wants to make sure that if the docker container is stopped then it can retrieve and come back to where it was at the same point of consumption

  3. Then all the frames that have been decoded by the application itself, will be sent to SageMaker for doing machine learning-based inference.

  4. Using these results we publish all the inference results into a Kinesis data stream that we have created.

  5. The Kinesis data stream can be consumed by a Lambda function for example to get notification in real time.

So with this kind of architecture, we're able to apply machine learning algorithms directly in real-time to a video stream and send and convert this into tangible actionable data, into Kinesis data stream. So that our other applications can consume that stream, and perform whatever is needed, for example notifications.