Amazon Simple Storage Service (S3)

When you modify a file in block storage, only the pieces that are changed are updated. When a file in object storage is modified, the entire object is updated. S3 provides object-level storage. It also offers unlimited storage space, although the maximum file size for an object is 5TB. In general:

11 9’s of durability (99.999999999% durability)
You can set permissions to control visibility and access to it.
You can also use the versioning feature to track changes to your objects over time.
With S3 you only pay for what you use, you can choose from a range of storage classes to select a fit, you will consider two factors: How often you plan to retrieve your data and How available you need your data to be.
Buckets must have a globally unique name
Objects (files) have a Key. The key is the FULL path:
- my_bucket/ my_file.txt
- my_bucket/ my_folder1/another_folder/my_file.txt
This will be interesting when we look at partitioning
Max object size is 5TB. May need split larger objects
Object Tags (key / value pair up to 10) are useful for security /lifecycle

Use cases for S3 are:

Backup and storage
Disaster Recovery
Archive
Hybrid Cloud storage
Application hosting
Media hosting
Data lakes & big data analytics
Software delivery
Static website

S3 for ML

S3 is also the backbone for many AWS ML services (e.g. Sagemaker). It is Useful for creating a Data Lake. The Decoupling of storage (S3) to compute Side (EC2, Amazon Athena, Amazon Redshift Spectrum, Amazon Rekognition , and AWS) allows for a massive scale-able Data Lake. It is a centralized Architectre, with all your data in one place.

Data Partitioning is the pattern for speeding up range queries (eg: AWS Athena). You can do this:

By Date e.g. s3://bucket/my data set/year/month/day/hour/data_00.csv
By Product: s3://bucket/my data set/product id/data_32.csv

This way you can quickly find the right data. You can define whatever partitioning strategy you like! Data partitioning will be handled by some tools we use (e.g. AWS Glue, Kinesis) aswell though.

S3 Features

Buckets

Amazon S3 allows people to store objects (files) in "buckets" (directories)
Buckets must have a globally unique name (across all regions all accounts)
Buckets are defined at the region level
S3 looks like a global service but buckets are created in a region
Naming convention
- No uppercase
- No underscore
- 3-63 characters long
- Not an IP
- Must start with lowercase letter or number

Objects

Objects (files) have a Key
The key is the FULL path:
- s3://my-bucket/my_file.txt
- s3://my-bucket/my_folder1/another_folder/my_file.txt
The key is composed of prefix + object name
- s3://my-bucket/my_folder1/another_folder/my_file.txt
There’s no concept of "directories" within buckets (although the UI will trick you to think otherwise)
Just keys with very long names that contain slashes ("/")
Object values are the content of the body:
- Max Object Size is 5TB (5000GB)
- If uploading more than 5GB, must use "multi-part upload"
Metadata (list of text key / value pairs - system or user metadata)
- Tags (Unicode key / value pair - up to 10) - useful for security / lifecycle
- Version ID (if versioning is enabled)

Security

User based
- IAM policies - which API calls should be allowed for a specific user from IAM console
Resource Based
- Bucket Policies - bucket wide rules from the S3 console - allows cross account
- Object Access Control List (ACL) - finer grain
- Bucket Access Control List (ACL) - less common
Note: an IAM principal can access an S3 object if
- the user IAM permissions allow it OR the resource policy ALLOWS it
- AND there’s no explicit DENY
Encryption: encrypt objects in Amazon S3 using encryption keys

Lifecycle Configuration

You can transition objects between storage classes
For infrequently accessed object, move them to STANDARD_IA
For archive objects you don't need in real time, rule of thumb is GLACIER or DEEP_ARCHIVE
Moving objects can be automated using a lifecycle configuration

S3 Lifecycle Rules

You can define:

Transition Actions: It defines when objects are transitioned to another storage class e.g.

Move objects to Standard IA class 60 days after creation
Move to Glacier for archiving after 6 months

Expiration Actions: configure objects to expire (delete) after some time e.g.

Access log files can be set to delete after a 365 days
Can be used to delete old versions of files (if versioning is enabled)
- If versioning is enabled, you can expire the older versions after a set amount of time
Can be used to delete incomplete multi part uploads

In addition:

Rules can be created for a certain prefix (e.g. s3://mybucket/mp3/*)
Rules can be created for certain objects tags (e.g. Department: Finance)

Allows you to optimize objects which are uploaded to s3.

S3 Bucket Policies

JSON based policies
- Resources: buckets and objects
- Actions: Set of API to Allow or Deny
- Effect: Allow / Deny Principal: The account or user to apply the policy to
Use S3 bucket for policy to:
- Grant public access to the bucket
- Force objects to be encrypted at upload
- Grant access to another account (Cross Account)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "sid": "PublicRead",
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3::examplebucket/*"]
    }
  ]
}

Bucket settings for Block Public Access

Block all public access: On
- Block public access to buckets and objects granted through new access control lists (ACLS): On
- Block public access to buckets and objects granted through any access control lists (ACLS): On
- Block public access to buckets and objects granted through new public bucket or access point policies: On
- Block public and cross-account access to buckets and objects through any public bucket or access point policies: On
These settings were created to prevent company data leaks
If you know your bucket should never be public, leave these on
Can be set at the account level

Websites

S3 can host static websites and have them accessible on the web
The website URL will be: bucket-name.s3-website-AWS-region.amazonaws.com or bucket-name.s3-website.AWS-region.amazonaws.com
If you get a 403 (Forbidden) error, make sure the bucket policy allows public reads!

Versioning

You can version your files in Amazon S3
It is enabled at the bucket level
Same key overwrite will increment the "version": 1, 2, 3….
It is best practice to version your buckets
- Protect against unintended deletes (ability to restore a version)
- Easy roll back to previous version
Notes:
- Any file that is not versioned prior to enabling versioning will have version "null"
- Suspending versioning does not delete the previous versions

Access Logs

For audit purpose, you may want to log all access to S3 buckets
Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
That data can be analyzed using data analysis tools…
Very helpful to come down to the root cause of an issue, or audit usage, view suspicious patterns, etc…

Replication (CRR & SRR)

Must enable versioning in source and destination
Cross Region Replication (CRR)
Same Region Replication (SRR)
Buckets can be in different accounts
Copying is asynchronous
Must give proper IAM permissions to S3
CRR - Use cases: compliance, lower latency access, replication across accounts
SRR - Use cases: log aggregation, live replication between production and test accounts

Storage Classes

S3 has several storage classes:

You can move between classes manually or using S3 Lifecycle configurations.

S3 Durability and Availability

Durability:
- High durability (99.999999999%, 11 9’s) of objects across multiple AZ
- If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years
- Same for all storage classes
Availability:
- Measures how readily available a service is
- Varies depending on storage class
- Example: S3 standard has 99.99% availability = not available 53 minutes a year

Standard General Purpose

Provides high availability for objects, it has a higher cost

99.99% Availability
Used for frequently accessed data
Low latency and high throughput
Store data in a minimum of three Availability Zones.
Sustain 2 concurrent facility failures
Use Cases: Big Data analytics, mobile & gaming applications, content distribution…

Standard Infrequent Access (Standard-IA)

Ideal for infrequently accessed but data that requires high availability when needed. The same level of availability as the S3 standard but with a lower storage price.

Ideal for infrequently accessed data.
Similar to the S3 standard but lower storage price and higher retrieval price.
For data that is less frequently accessed, but requires rapid access when needed
Lower cost than S3 Standard
99.9% Availability
Use cases: Disaster Recovery, backups

One Zone-Infrequent Access (One Zone-IA)

Store data in a single availability zone.
Lower storage price than S3 Standard-IA.
High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed
99.5% Availability
Use Cases: Storing secondary backup copies of on-premise data, or data you can recreate

Glacier Storage Classes

These are Low-cost object storage meant for archiving / backup. Pricing: price for storage + object retrieval cost

Low-cost storage designed for data archiving / backups
Pricing: price for storage + object retrieval cost
Able to retrieve objects within a few minutes to hours.

Glacier Instant Retrieval

Millisecond retrieval, great for data accessed once a quarter
Minimum storage duration of 90 days

Glacier Flexible Retrieval (formerly Glacier)

Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours) - free
Minimum storage duration of 90 days

Glacier Deep Archive

Standard (12 hours), Bulk (48 hours)
Minimum storage duration of 180 days
Lowest-cost object storage class, ideal for archiving.

S3 Intelligent-Tiering

Ideal for data with unknown or changing access patterns.
Requires a small monthly monitoring and and auto-tiering fee per object.
Moves objects automatically between Access Tiers based on usage
There are no retrieval charges in S3 Intelligent Tiering

S3 will move the objects for you.

Frequent Access tier (automatic): default tier
Infrequent Access tier (automatic): objects not accessed for 30 days
Archive Instant Access tier (automatic): objects not accessed for 90 days
Archive Access tier (optional): configurable from 90 days to 700+ days
Deep Archive Access tier (optional): config. from 180 days to 700+ days

You can move between classes manually or using S3 Lifecycle configurations

Class Comparison

	Standard	Intelligent Tiering	Standard-IA	One Zone-IA	Glacier Instant Retrieval	Glacier Flexible Retrieval	Glacier Deep Archive
Durability	99.999999999% (11 9’s)	99.999999999% (11 9’s)	99.999999999% (11 9’s)	99.999999999% (11 9’s)	99.999999999% (11 9’s)	99.999999999% (11 9’s)	99.999999999% (11 9’s)
Availability	99.99%	99.90%	99.90%	99.50%	99.90%	99.99%	99.99%
Availability SLA	99.90%	99.00%	99.00%	99.00%	99.00%	99.90%	99.90%
Availability Zones	$\geq$ 3	$\geq$ 3	$\geq$ 3	1	$\geq$ 3	$\geq$ 3	$\geq$ 3
Min. Storage Duration Charge	None	None	30 Days	30 Days	90 Days	90 Days	180 Days
Min. Billable Object Size	None	None	128 KB	128 KB	128 KB	40 KB	40 KB
Retrieval Fee	None	None	Per GB retrieved	Per GB retrieved	Per GB retrieved	Per GB retrieved	Per GB retrieved

Price Comparison

Here is a price comparison from us-east-1:

	Standard	Intelligent-Tiering	Standard-IA	One Zone-IA	Glacier Instant Retrieval	Glacier Flexible Retrieval	Glacier Deep Archive
Storage Cost (per GB per month)	$0.023	$0.0025 - $0.023	$0.0125	$0.01	$0.004	$0.0036	$0.00099
Retrieval Cost (per 1000 request)	GET: $0.0004 POST: $0.005	GET: $0.0004 POST: $0.005	GET: $0.001 POST: $0.01	GET: $0.001 POST: $0.01	GET: $0.01 POST: $0.02	GET: $0.0004 POST: $0.03 Expedited: $10 Standard: $0.05 Bulk: free	GET: $0.0004 POST: $0.05 Standard: $0.10 Bulk: $0.025**
Retrieval Time	Instantaneous	Instantaneous	Instantaneous	Instantaneous	Instantaneous	Expedited (1 – 5 mins) Standard (3 – 5 hours) Bulk (5 – 12 hours)	Standard (12 hours) Bulk (48 hours)
Monitoring Cost (pet 1000 objects)		$0.0025

Security: S3 Encryption for Objects

SSE stands for Server-Side Encryption. There are 4 methods of encrypting objects in S3:

SSE-S3 (AES-256): encrypts S3 objects using keys handled & managed by AWS
- S3's own Managed Data Key and object from the request is used for encryption before being added to bucket.
- We don't manage this key
SSE-KMS: use AWS Key Management Service to manage encryption keys
- Additional security (user must have access to KMS key)
- Audit trail for KMS key usage
- The key used now for encrypting that object is the KMS Customer Master Key, and this KMS customer master key is something that we can manage ourselves in AWS
SSE-C: when you want to manage your own encryption keys
Client-Side Encryption
- Encrypt Data outside of AWS before sending it to the S3

From an ML perspective, SSE-S3 and SSE-KMS will be most likely used.

SSE-S3

We have the Amazon S3 bucket and we want to put an object in it. We send that object into S3 using a HTTP request and the object will arrive in S3, and for SSE-S3, S3 has it's own managed data key and it will take that data key, plus the object, and will perform some encryption before adding this object to your bucket. So, as we can see here, this key managed by S3, we have no idea what it is. We don't manage it and that's fine.

SSE-KMS

KMS it is the exact same pattern, we have our objects and we want to insert it into S3, so we'll send it to S3 via HTTP, and then the key used now for encrypting that object is generated by the KMS customer master key, and this KMS customer master key is something that we can manage ourselves within AWS.

So we have more control, maybe more safety, maybe an increased idea of security by using this pattern. AWS will use this customer master key (CMK) and encrypt the objects and put this into the bucket.

S3 Security & Bucket Policies

Security:

User based:
- IAM policies which API calls should be allowed for a specific user
Resource Based
- Bucket Policies: bucket wide rules from the S3 console allows cross account access etc.
- Object Access Control List (ACL) finer grain
- Bucket Access Control List (ACL) less common

S3 Bucket Policies:

JSON based policies:
- Resources: buckets and objects
- Actions: Set of API to Allow or Deny
- Effect: Allow / Deny
- Principal: The account or user to apply the policy to
Use S3 bucket for policy to:
- Grant public access to the bucket
- Force objects to be encrypted at upload
- Grant access to another account (Cross Account)

Instead of using bucket polices, we can set default encryption. The old way to enable default encryption was to use a bucket policy and refuse any HTTP command without the proper headers:

The new way is to use the "default encryption" option in S3, and using default encryption will have a way to just say to S3 to encrypt every single object that is sent to it.

Note: Bucket Policies are evaluated before "default encryption". You can choose default encryption for all the items in a bucket. Other things to remember for security:

Networking: VPC (Virtual Private Cloud) Endpoint Gateway:
- Right now when we use S3, we go over the public internet, and all our data goes over the public internet.
- But if I have private traffic within my VPC or virtual private cloud, then I want it to remain within this VPC.
- For this we can create a VPC end point gateway for S3, and that will only allow the traffic to stay within the VPC instead of going through the public web for S3.
- Allow traffic to stay within your VPC (instead of going through public web)
- Makes sure your private services (AWS SageMaker) can access S3
- Very important for AWS ML Exam
- If you want your traffic to remain within your VPC for S3 and not go over the public web, then creating a VPC Endpoint Gateway is the way to go
Logging and Audit:
- S3 access logs can be stored in other S3 bucket
- API calls can be logged in AWS CloudTrail
- Can see who did what and why
Tagged Based (combined with IAM policies and bucket policies)
- Example: Add tag Classification=PHI (personal health information) to your objects
- Can restrict access to data with this tag using a bucket/IAM policy
- Have tagged this data as quite sensitve

S3 Object Lock & Glacier Vault Lock

S3 Object Lock
- Adopt a WORM (Write Once Read Many) model
- Block an object version deletion for a specified amount of time
Glacier Vault Lock
- Adopt a WORM (Write Once Read Many) model
- Lock the policy for future edits (can no longer be changed)
- Helpful for compliance and data retention

Shared Responsibility Model for S3

AWS	CUSTOMER
Infrastructure (global security, durability, availability, sustain concurrent loss of data in two facilities)	S3 Versioning, S3 Bucket Policies, S3 Replication Setup
Configuration and vulnerability analysis	Logging and Monitoring, S3 Storage Classes
Compliance validation	Data encryption at rest and in transit

AWS Snow Family

Highly-secure, portable devices to collect and process data at the edge, and migrate data into and out of AWS
Data migration:
- Snowcone
- Snowball Edge
- Snowmobile
Edge computing:
- Snowcone
- Snowball Edge

Data Migrations with AWS Snow Family

AWS Snow Family: offline devices to perform data migrations If it takes more than a week to transfer over the network, use Snowball devices!
Challenges:
- Limited connectivity
- Limited bandwidth
- High network cost
- Shared bandwidth (can’t maximize the line)
- Connection stability

Time to Transfer

Data	100 Mbps	1Gbps	10Gbps
10 TB	12 days	30 hours	3 hours
100 TB	124 days	12 days	30 hours
1 PB	3 years	124 days	12 days

Snowball Edge (for data transfers)

Physical data transport solution: move TBs or PBs of data in or out of AWS
Alternative to moving data over the network (and paying network fees)
Pay per data transfer job
Provide block storage and Amazon S3-compatible object storage
Snowball Edge Storage Optimized
- 80 TB of HDD capacity for block volume and S3 compatible object storage
Snowball Edge Compute Optimized
- 42 TB of HDD capacity for block volume and S3 compatible object storage
Use cases: large data cloud migrations, DC decommission, disaster recovery

AWS Snowcone

Small, portable computing, anywhere, rugged & secure, withstands harsh environments
Light (4.5 pounds, 2.1 kg)
Device used for edge computing, storage, and data transfer
8 TBs of usable storage
Use Snowcone where Snowball does not fit (space-constrained environment)
Must provide your own battery / cables
Can be sent back to AWS offline, or connect it to internet and use AWS DataSync to send data

AWS Snowmobile

Transfer exabytes of data (1 EB = 1,000 PB = 1,000,000 TBs)
Each Snowmobile has 100 PB of capacity (use multiple in parallel)
High security: temperature controlled, GPS, 24/7 video surveillance
Better than Snowball if you transfer more than 10 PB

Properties	Snowcone	Snowball Edge Storage Optimized	Snowmobile
Storage Capacity	8 TB usable	80 TB usable	< 100 PB
Migration Size	Up to 24 TB, online and offline	Up to petabytes, offline	Up to exabytes, offline

Usage Process

Request Snowball devices from the AWS console for delivery
Install the snowball client / AWS OpsHub on your servers
Connect the snowball to your servers and copy files using the client
Ship back the device when you’re done (goes to the right AWS facility)
Data will be loaded into an S3 bucket
Snowball is completely wiped

Edge Computing

What is Edge Computing?

Process data while it’s being created on an edge location
- A truck on the road, a ship on the sea, a mining station underground...
These locations may have
- Limited / no internet access
- Limited / no easy access to computing power
We setup a Snowball Edge / Snowcone device to do edge computing
Use cases of Edge Computing:
- Preprocess data
- Machine learning at the edge
- Transcoding media streams
Eventually (if need be) we can ship back the device to AWS (for transferring data for example)

Edge Computing Devices

Snowcone (smaller)
- 2 CPUs, 4 GB of memory, wired or wireless access
- USB-C power using a cord or the optional battery
Snowball Edge - Compute Optimized
- 52 vCPUs, 208 GiB of RAM
- Optional GPU (useful for video processing or machine learning)
- 42 TB usable storage
Snowball Edge - Storage Optimized
- Up to 40 vCPUs, 80 GiB of RAM
- Object storage clustering available
All: Can run EC2 Instances & AWS Lambda functions (using AWS IoT Greengrass)
Long-term deployment options: 1 and 3 years discounted pricing

AWS OpsHub

Historically, to use Snow Family devices, you needed a CLI (Command Line Interface tool)
Today, you can use AWS OpsHub (a software you install on your computer / laptop) to manage your Snow Family Device
- Unlocking and configuring single or clustered devices
- Transferring files
- Launching and managing instances running on Snow Family Devices
- Monitor device metrics (storage capacity, active instances on your device)
- Launch compatible AWS services on your devices (ex: Amazon EC2 instances, AWS DataSync, Network File System (NFS))

Hybrid Cloud for Storage

AWS is pushing for "hybrid cloud"
- Part of your infrastructure is on-premises
- Part of your infrastructure is on the cloud
This can be due to
- Long cloud migrations
- Security requirements
- Compliance requirements
- IT strategy
S3 is a proprietary storage technology (unlike EFS / NFS), so how do you expose the S3 data on-premise?
AWS Storage Gateway!

AWS Storage Gateway

Bridge between on-premise data and cloud data in S3
Hybrid storage service to allow on-premises to seamlessly use the AWS Cloud
Use cases: disaster recovery, backup & restore, tiered storage
Types of Storage Gateway:
- File Gateway
- Volume Gateway
- Tape Gateway
No need to know the types at the exam

Summary

Buckets vs Objects: global unique name, tied to a region
S3 security: IAM policy, S3 Bucket Policy (public access), S3 Encryption
S3 Websites: host a static website on Amazon S3
S3 Versioning: multiple versions for files, prevent accidental deletes
S3 Access Logs: log requests made within your S3 bucket
S3 Replication: same-region or cross-region, must enable versioning
S3 Storage Classes: Standard, IA, 1Z-IA, Intelligent, Glacier, Glacier Deep Archive
S3 Lifecycle Rules: transition objects between classes
S3 Glacier Vault Lock / S3 Object Lock: WORM (Write Once Read Many)
Snow Family: import data onto S3 through a physical device, edge computing
OpsHub: desktop application to manage Snow Family devices
Storage Gateway: hybrid solution to extend on-premises storage to S3

EC2 Instance Storage Databases & Analytics