Databases & Analytics

Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits. Sometimes, you want to store data in a database so you can structure the data. You build indexes to efficiently query / search through the data, you define relationships between your datasets. Databases are optimized for a purpose and come with different features, shapes and constraint

Types of Databases

Relational

Traditionally when we think of databases we often think of tables of data with headers, columns and rows. This is called a relational database - one where we can define set categories, put the data into established tables and define the relationship between them. Each table will represent a specific 'entity type', for example: users, orders and products. The reason for this is to make sure there is a separation of concerns between tables. Each table will have:

Headers known as Attributes
Rows named Records
A unique ID called a Key

Relational Model

With a relational database we can place constraints on an attribute to restrict the type of data that we are storing, for example ensuring height is stored as a number and restricting it to be between two realistic figures. We can also put a constraint in place to say if a piece of data is optional or not.

We put constraints in place to ensure that we maintain high quality data which will likely be easier to maintain over time.

The language used for accessing data in a relational database is called SQL.

Examples of SQL databases are SQLite (opens in a new tab), PostgreSQL (opens in a new tab) and MySQL (opens in a new tab).

Non-Relational

Relational databases are very useful if we are certain of the format of the data. However not all data will fit nicely into that kind of structure which has given rise to the non-relational database. A common type of non-relational database is a document data store, which looks a lot like a JavaScript object, typically stored in JSON. There is no mandatory way that each document is created, allowing for flexibility depending on the data we are storing. Much like a relational database though, each document will have a key so that it can be accessed uniquely.

Non-Relational Model

Non-relational databases are often referred to as NoSQL. Examples of NoSQL databases are MongoDB (opens in a new tab) and CouchDB (opens in a new tab).

Databases & Shared Responsibility on AWS

AWS offers use to manage different databases
Benefits include:
- Quick Provisioning, High Availability, Vertical and Horizontal Scaling
- Automated Backup & Restore, Operations, Upgrades
- Operating System Patching is handled by AWS
- Monitoring, alerting
Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling

Relational Databases on AWS

Amazon Relational Database Service (RDS)

RDS is a managed DB service for databases that use SQL as a query language. It enables you to run relational databases in the AWS Cloud Managed services that automate tasks such as:

Hardware provisioning
Database setup
Patching
Backups
Database Engines:
- PostgreSQL
- MySQL
- Maria DB
- Oracle
- Microsoft SQL Server
- Aurora (AWS Proprietary database)
  - Compatible with MySQL and PostgreSQL.
  - Five times faster than MySQL and three times faster than PostgreSQL.
  - Reduce unnecessary I/O operations.
  - Replicates six copies of your data across three availability zones.
  - Continuously backs up your data.

Advantage over using RDS versus deploying DB on EC2

RDS is a managed service:
- Automated provisioning, OS patching
- Continuous backups and restore to specific timestamp (Point in Time Restore)!
- Monitoring dashboards
- Read replicas for improved read performance
- Multi AZ setup for DR (Disaster Recovery)
- Maintenance windows for upgrades
- Scaling capability (vertical and horizontal)
- Storage backed by EBS (gp2 or io1)
BUT you can’t SSH into your instances

RDS Deployments: Read Replicas, Multi-AZ

Read Replicas	Multi-AZ
Scale the read workload of your DB	Failover in case of AZ outage (high availability)
Can create up to 5 Read Replicas	Data is only read/written to the main database
Data is only written to the main DB	Can only have 1 other AZ as failover

Read Replicas Multi-AZ

RDS Deployments: Multi-Region

Multi-Region (Read Replicas)
- Disaster recovery in case of region issue
- Local performance for global reads
- Replication cost

Multi-Region

Amazon Aurora

Aurora is a proprietary technology from AWS (not open sourced)
PostgreSQL and MySQL are both supported as Aurora DB
Aurora is "AWS cloud optimized" and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS
Aurora storage automatically grows in increments of 10GB, up to 64 TB.
Aurora costs more than RDS (20% more) - but is more efficient
Not in the free tier

Amazon Redshift

Data Warehousing that you can use for big data analytics. It offers the ability to collect data from many sources and helps you to understand relationships, and trends across your data.

Redshift is based on PostgreSQL, but it’s not used for OLTP (Online Transactional Processing)
It’s OLAP - online analytical processing (analytics and data warehousing)
Load data once every hour, not every second
10x better performance than other data warehouses, scale to PBs of data
Columnar storage of data (instead of row based)
Massively Parallel Query Execution (MPP), highly available
Pay as you go based on the instances provisioned
Has a SQL interface for performing the queries
BI tools such as AWS Quicksight or Tableau integrate with it

Non-Relational Databases on AWS

DynamoDB

DynamoDB is a key-value database. It delivers single-digit millisecond performance at any scale.

Its serverless, which means that you do not have to provision, patch or manage servers.
You also do not have to install, maintain or operate the software.
Automatically scales and maintain performance.
Suitable choice for use cases that require high performance while scaling.
Fully Managed Highly available with replication across 3 AZ
Scales to massive workloads, distributed "serverless" database
Millions of requests per seconds, trillions of row, 100s of TB of storage
Fast and consistent in performance
Single-digit millisecond latency - low latency retrieval
Integrated with IAM for security, authorization and administration
Low cost and auto scaling capabilities
Standard & Infrequent Access (IA) Table Class

DynamoDB Accelerator (DAX)

Is an in-memory cache for DynamoDB, it helps to improve response times from single-digit milliseconds to microseconds.

Fully Managed in-memory cache for DynamoDB
10x performance improvement - single- digit millisecond latency to microseconds latency - when accessing your DynamoDB tables
Secure, highly scalable & highly available
Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases

DynamoDB Global Tables

Make a DynamoDB table accessible with low latency in multiple-regions
Active-Active replication (read/write to any AWS Region)

DocumentDB

A document database that supports MongoDB workloads.

The way Aurora is an "AWS-implementation" of PostgreSQL / MySQL, DocumentDB is the same for MongoDB (which is a NoSQL database)
MongoDB is used to store, query, and index JSON data
Similar "deployment concepts" as Aurora
Fully Managed, highly available with replication across 3 AZ
Aurora storage automatically grows in increments of 10GB, up to 64 TB.
Automatically scales to workloads with millions of requests per seconds

Graph Databases

Amazon Neptune

A graph database service. You can use it to build and run applications that work with highly connected datasets, such as recommendations engines, fraud detection, and knowledge graphs.

Fully managed graph database
A popular graph dataset would be a social network
- Users have friends
- Posts have comments
- Comments have likes from users
- Users share and like posts…
Highly available across 3 AZ, with up to 15 read replicas
Build and run applications working with highly connected datasets - optimized for these complex and hard queries
Can store up to billions of relations and query the graph with milliseconds latency
Highly available with replications across multiple AZs
Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking

Other Databases

Amazon Quantum Ledger Database (QLDB)

QLDB stands for "Quantum Ledger Database"
A ledger is a book recording financial transactions
Fully Managed, Serverless, High available, Replication across 3 AZ
Used to review history of all the changes made to your application data over time
Immutable system: no entry can be removed or modified, cryptographically verifiable
2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL
Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules

Amazon Managed Blockchain

You can use it to create and manage blockchain networks with open-source frameworks.

Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority.
Amazon Managed Blockchain is a managed service to:
- Join public blockchain networks
- Or create your own scalable private network
Compatible with the frameworks Hyperledger Fabric & Ethereum

Amazon ElastiCache

Add caching layers on top of your databases to help improve the read times of common requests. Supports Redis and Memcached.

ElastiCache is to get managed Redis or Memcached
Caches are in-memory databases with high performance, low latency
Helps reduce load off databases for read intensive workloads
AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backup

Analytics

Amazon EMR

EMR stands for "Elastic MapReduce"
EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
The clusters can be made of hundreds of EC2 instances
Also supports Apache Spark, HBase, Presto, Flink
EMR takes care of all the provisioning and configuration
Auto-scaling and integrated with Spot instances
Use cases: data processing, machine learning, web indexing, big data

Amazon Athena

Serverless query service to analyze data stored in Amazon S3
Uses standard SQL language to query the files
Supports CSV, JSON, ORC, Avro, and Parquet (built on Presto)
Pricing: $5.00 per TB of data scanned
Use compressed or columnar data for cost-savings (less scan)
Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc...
analyze data in S3 using serverless SQL, use Athena

Amazon QuickSight

Serverless machine learning-powered business intelligence service to create interactive dashboards
Fast, automatically scalable, embeddable, with per-session pricing
Use cases:
- Business analytics
- Building visualizations
- Perform ad-hoc analysis
- Get business insights using data
Integrated with RDS, Aurora, Athena, Redshift, S3…

AWS Glue

Managed extract, transform, and load (ETL) service
Useful to prepare and transform data for analytics
Fully serverless service
Glue Data Catalog: catalog of datasets
- can be used by Athena, Redshift, EMR

AWS Database Migration Service (DMS)

Enables you to migrate relational databases, non-relational databases, and other types of data store

Quickly and securely migrate databases to AWS, resilient, self healing
During the migration, your source database remains operational.
Other uses cases:
- Development and test database migration.
- Database consolidation.
- Continuous Data Replication using CDC (Change Data Capture)
  - You must create an EC2 instance to perform the replication tasks
The source and target database can be of the same type or different types. It supports:
- Homogeneous migrations: e.g. an Oracle database on premise to an Oracle database in the AWS cloud
- Heterogeneous migrations: ex a Microsoft SQL server on premise, and you migrate to Aurora.

Amazon S3 IAM: Identity Access & Management