Amazon Athena
Serverless interactive queries of S3 data lake. This content overlaps with the AWS certified data analysis or a Big Data exam, but is more high level for the ML specialty exam.
-
Athena is an Interactive query service for S3 (SQL)
-
No need to load data, it stays in S3
-
Can have a data lake of csv files for example.
-
-
Under the hood, uses open-source Presto
-
Serverless
- No servers to manage
-
Supports many data formats
-
CSV (human readable)
-
JSON (human readable)
-
ORC (columnar, splittable)
-
Parquet (columnar, splittable)
-
Avro (splittable)
-
-
it lets you do SQL queries on structured, unstructured, or semi structured data in an S3 data lake
Some examples of usage include:
-
Allows for impromptu queries of web logs
-
Querying staging data before loading it into Redshift
-
Analyze CloudTrail / CloudFront / VPC / ELB etc logs in S3
-
Integration with Jupyter, Zeppelin, RStudio notebooks
-
Can run queries from within your notebooks
-
Good for analyzing data from there
-
-
Integration with QuickSight
- A Data Visualization Tool
-
Integration via ODBC / JDBC with other visualization tools
Athena & Glue
It's important to understand the relationship between Athena and AWS glue. If you're using glue to impart some structure to your S3 data lake and extract what some columns might be that correlate with the data that's sitting in there, Athena can take advantage of that.
The glue data catalogue can create a metadata repository across various services including Athena, it can crawl your data and S3, extract that schema from it and Athena can use that schema to issue SQL queries and come up with names for your columns.
Glue's fully managed ETL capabilities can be used to transform data, or convert it into columnar formats, to optimize the costs and improve the performance of your Athena queries. A typical pipeline might look like this:
you have data sitting in S3, you have a glue crawler that extracts the actual meaning of that data (its structure). Athena then sits on top of that, issues queries on it and you could turn around and feed those Athena queries into QuickSight to visualize them.
Athena cost model
-
Pay-as-you-go
-
$5 per TB scanned i.e., Cheap
-
Charged per amount of data scanned per query
-
you can keep that cost down by compressing your data by the way, so if you want to keep Athena cheap, you want to compress the data going into it.
-
Successful or cancelled queries count, failed queries do not.
-
No charge for DDL (CREATE/ALTER/DROP etc.)
-
-
Save LOTS of money by using columnar formats
-
it allows Athena to selectively read only the columns that it needs to process the data for that query.
-
ORC, Parquet in conjunction with Athena is a good idea
-
Save 30-90%, and get better performance
-
-
Glue and S3 have their own charges in addition to Athena
Athena Security
-
Access control
-
IAM, ACLs, S3 bucket policies
-
There are IAM policies for AmazonAthenaFullAccess / AWSQuicksightAthenaAccess
-
-
Can encrypt results at rest in an S3 staging directory
-
Have various means of encrypting data at your disposal:
-
Server-side encryption with S3-managed key (SSE-S3)
-
Server-side encryption with KMS key (SSE-KMS)
-
Client-side encryption with KMS key (CSE-KMS)
-
-
Cross-account access in S3 bucket policy possible
-
For in-transit data we have Transport Layer Security (TLS) used at all times to encrypt in- transit data between Athena and S3
What not to use Athena for
-
Highly formatted reports / visualization
- That's what QuickSight is for
-
ETL
- Use Glue instead
Athena is just for doing ad-hoc queries using SQL against your S3 data lake.