Cloud & MLOps ☁️
Storage & Databases
Databases & Analytics

Databases & Analytics

Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits. Sometimes, you want to store data in a database so you can structure the data. You build indexes to efficiently query / search through the data, you define relationships between your datasets. Databases are optimized for a purpose and come with different features, shapes and constraint

Types of Databases

Relational

Traditionally when we think of databases we often think of tables of data with headers, columns and rows. This is called a relational database - one where we can define set categories, put the data into established tables and define the relationship between them. Each table will represent a specific 'entity type', for example: users, orders and products. The reason for this is to make sure there is a separation of concerns between tables. Each table will have:

  • Headers known as Attributes
  • Rows named Records
  • A unique ID called a Key

Relational Model

With a relational database we can place constraints on an attribute to restrict the type of data that we are storing, for example ensuring height is stored as a number and restricting it to be between two realistic figures. We can also put a constraint in place to say if a piece of data is optional or not.

We put constraints in place to ensure that we maintain high quality data which will likely be easier to maintain over time.

The language used for accessing data in a relational database is called SQL.

Examples of SQL databases are SQLite (opens in a new tab), PostgreSQL (opens in a new tab) and MySQL (opens in a new tab).

Non-Relational

Relational databases are very useful if we are certain of the format of the data. However not all data will fit nicely into that kind of structure which has given rise to the non-relational database. A common type of non-relational database is a document data store, which looks a lot like a JavaScript object, typically stored in JSON. There is no mandatory way that each document is created, allowing for flexibility depending on the data we are storing. Much like a relational database though, each document will have a key so that it can be accessed uniquely.

Non-Relational Model

Non-relational databases are often referred to as NoSQL. Examples of NoSQL databases are MongoDB (opens in a new tab) and CouchDB (opens in a new tab).

Databases & Shared Responsibility on AWS

  • AWS offers use to manage different databases
  • Benefits include:
    • Quick Provisioning, High Availability, Vertical and Horizontal Scaling
    • Automated Backup & Restore, Operations, Upgrades
    • Operating System Patching is handled by AWS
    • Monitoring, alerting
  • Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling

Relational Databases on AWS

Amazon Relational Database Service (RDS)

RDS is a managed DB service for databases that use SQL as a query language. It enables you to run relational databases in the AWS Cloud Managed services that automate tasks such as:

  • Hardware provisioning

  • Database setup

  • Patching

  • Backups

  • Database Engines:

    • PostgreSQL
    • MySQL
    • Maria DB
    • Oracle
    • Microsoft SQL Server
    • Aurora (AWS Proprietary database)
      • Compatible with MySQL and PostgreSQL.
      • Five times faster than MySQL and three times faster than PostgreSQL.
      • Reduce unnecessary I/O operations.
      • Replicates six copies of your data across three availability zones.
      • Continuously backs up your data.

Advantage over using RDS versus deploying DB on EC2

  • RDS is a managed service:
    • Automated provisioning, OS patching
    • Continuous backups and restore to specific timestamp (Point in Time Restore)!
    • Monitoring dashboards
    • Read replicas for improved read performance
    • Multi AZ setup for DR (Disaster Recovery)
    • Maintenance windows for upgrades
    • Scaling capability (vertical and horizontal)
    • Storage backed by EBS (gp2 or io1)
  • BUT you can’t SSH into your instances

RDS Deployments: Read Replicas, Multi-AZ

Read ReplicasMulti-AZ
Scale the read workload of your DBFailover in case of AZ outage (high availability)
Can create up to 5 Read ReplicasData is only read/written to the main database
Data is only written to the main DBCan only have 1 other AZ as failover

Read Replicas Multi-AZ

RDS Deployments: Multi-Region

  • Multi-Region (Read Replicas)
    • Disaster recovery in case of region issue
    • Local performance for global reads
    • Replication cost

Multi-Region

Amazon Aurora

  • Aurora is a proprietary technology from AWS (not open sourced)
  • PostgreSQL and MySQL are both supported as Aurora DB
  • Aurora is "AWS cloud optimized" and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS
  • Aurora storage automatically grows in increments of 10GB, up to 64 TB.
  • Aurora costs more than RDS (20% more) - but is more efficient
  • Not in the free tier

Amazon Redshift

Data Warehousing that you can use for big data analytics. It offers the ability to collect data from many sources and helps you to understand relationships, and trends across your data.

  • Redshift is based on PostgreSQL, but it’s not used for OLTP (Online Transactional Processing)
  • It’s OLAP - online analytical processing (analytics and data warehousing)
  • Load data once every hour, not every second
  • 10x better performance than other data warehouses, scale to PBs of data
  • Columnar storage of data (instead of row based)
  • Massively Parallel Query Execution (MPP), highly available
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BI tools such as AWS Quicksight or Tableau integrate with it

Non-Relational Databases on AWS

DynamoDB

DynamoDB is a key-value database. It delivers single-digit millisecond performance at any scale.

  • Its serverless, which means that you do not have to provision, patch or manage servers.
  • You also do not have to install, maintain or operate the software.
  • Automatically scales and maintain performance.
  • Suitable choice for use cases that require high performance while scaling.
  • Fully Managed Highly available with replication across 3 AZ
  • Scales to massive workloads, distributed "serverless" database
  • Millions of requests per seconds, trillions of row, 100s of TB of storage
  • Fast and consistent in performance
  • Single-digit millisecond latency - low latency retrieval
  • Integrated with IAM for security, authorization and administration
  • Low cost and auto scaling capabilities
  • Standard & Infrequent Access (IA) Table Class

DynamoDB Accelerator (DAX)

Is an in-memory cache for DynamoDB, it helps to improve response times from single-digit milliseconds to microseconds.

  • Fully Managed in-memory cache for DynamoDB
  • 10x performance improvement - single- digit millisecond latency to microseconds latency - when accessing your DynamoDB tables
  • Secure, highly scalable & highly available
  • Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases

DynamoDB Global Tables

  • Make a DynamoDB table accessible with low latency in multiple-regions
  • Active-Active replication (read/write to any AWS Region)

DocumentDB

A document database that supports MongoDB workloads.

  • The way Aurora is an "AWS-implementation" of PostgreSQL / MySQL, DocumentDB is the same for MongoDB (which is a NoSQL database)
  • MongoDB is used to store, query, and index JSON data
  • Similar "deployment concepts" as Aurora
  • Fully Managed, highly available with replication across 3 AZ
  • Aurora storage automatically grows in increments of 10GB, up to 64 TB.
  • Automatically scales to workloads with millions of requests per seconds

Graph Databases

Amazon Neptune

A graph database service. You can use it to build and run applications that work with highly connected datasets, such as recommendations engines, fraud detection, and knowledge graphs.

  • Fully managed graph database
  • A popular graph dataset would be a social network
    • Users have friends
    • Posts have comments
    • Comments have likes from users
    • Users share and like posts…
  • Highly available across 3 AZ, with up to 15 read replicas
  • Build and run applications working with highly connected datasets - optimized for these complex and hard queries
  • Can store up to billions of relations and query the graph with milliseconds latency
  • Highly available with replications across multiple AZs
  • Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking

Other Databases

Amazon Quantum Ledger Database (QLDB)

  • QLDB stands for "Quantum Ledger Database"
  • A ledger is a book recording financial transactions
  • Fully Managed, Serverless, High available, Replication across 3 AZ
  • Used to review history of all the changes made to your application data over time
  • Immutable system: no entry can be removed or modified, cryptographically verifiable
  • 2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL
  • Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules

Amazon Managed Blockchain

You can use it to create and manage blockchain networks with open-source frameworks.

  • Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority.
  • Amazon Managed Blockchain is a managed service to:
    • Join public blockchain networks
    • Or create your own scalable private network
  • Compatible with the frameworks Hyperledger Fabric & Ethereum

Amazon ElastiCache

Add caching layers on top of your databases to help improve the read times of common requests. Supports Redis and Memcached.

  • ElastiCache is to get managed Redis or Memcached
  • Caches are in-memory databases with high performance, low latency
  • Helps reduce load off databases for read intensive workloads
  • AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backup

Analytics

Amazon EMR

  • EMR stands for "Elastic MapReduce"
  • EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
  • The clusters can be made of hundreds of EC2 instances
  • Also supports Apache Spark, HBase, Presto, Flink
  • EMR takes care of all the provisioning and configuration
  • Auto-scaling and integrated with Spot instances
  • Use cases: data processing, machine learning, web indexing, big data

Amazon Athena

  • Serverless query service to analyze data stored in Amazon S3
  • Uses standard SQL language to query the files
  • Supports CSV, JSON, ORC, Avro, and Parquet (built on Presto)
  • Pricing: $5.00 per TB of data scanned
  • Use compressed or columnar data for cost-savings (less scan)
  • Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc...
  • analyze data in S3 using serverless SQL, use Athena

Amazon QuickSight

  • Serverless machine learning-powered business intelligence service to create interactive dashboards
  • Fast, automatically scalable, embeddable, with per-session pricing
  • Use cases:
    • Business analytics
    • Building visualizations
    • Perform ad-hoc analysis
    • Get business insights using data
  • Integrated with RDS, Aurora, Athena, Redshift, S3…

AWS Glue

  • Managed extract, transform, and load (ETL) service
  • Useful to prepare and transform data for analytics
  • Fully serverless service
  • Glue Data Catalog: catalog of datasets
    • can be used by Athena, Redshift, EMR

AWS Database Migration Service (DMS)

Enables you to migrate relational databases, non-relational databases, and other types of data store

  • Quickly and securely migrate databases to AWS, resilient, self healing
  • During the migration, your source database remains operational.
  • Other uses cases:
    • Development and test database migration.
    • Database consolidation.
    • Continuous Data Replication using CDC (Change Data Capture)
      • You must create an EC2 instance to perform the replication tasks
  • The source and target database can be of the same type or different types. It supports:
    • Homogeneous migrations: e.g. an Oracle database on premise to an Oracle database in the AWS cloud
    • Heterogeneous migrations: ex a Microsoft SQL server on premise, and you migrate to Aurora.