Cloud & MLOps ☁️
Glue Data Catalog

AWS Glue

When using the AWS Glue Find Matches ML Transform, the labelling file must be encoded as UTF-8 without BOM. The AWS Glue DynamicFrame allows for each record to be self-describing so it can handle unknown or changing schemas.

Currently document DB is not supported as input for AWS glue crawlers.

Glue Data Catalog

  • The Glue Data Catalog is a metadata repository for all your tables in your account

    • Automated Schema Inference for all these tables

    • Schemas are versioned

    • So the idea is that you'll have your glue in a catalog and will index for example all your data sets within Amazon S3.

  • This metadata repository integrates with Athena or Redshift Spectrum, to do schema & data discovery

    • The idea is that you can use the schemas from the glue data catalog and use those into your favorite data warehouse or your favorite serverless SQL query tools
  • Glue Crawlers can help build the Glue Data Catalog

    • the crawlers will just go around all your data tables and your databases and your S3 buckets, and figure out what data is there for you.

Crawlers

  • Crawlers go through your data to infer schemas and partitions
  • Works JSON, Parquet, CSV, relational data store
  • Crawlers work for different data sources
    • S3, Amazon Redshift, Amazon RDS etc.
  • Run the Crawler on a Schedule or On Demand
  • Need an IAM role / credentials to access the data stores
  • Add a database and put the output there
    • The crawler will add tables based on the schema it has inferred
    • Will figure out partitioning schemes

Glue and S3 Partitions

Glue also has the concept of partitions built in. The Glue crawler will extract partitions based on how your S3 data is organized. So, you need to think up front about how you will be organizing data because based on that the partitioning will be defined and the queries will be optimized.

Example: devices send sensor data every hour. Do you query primarily by time ranges? If so, organize your buckets as s3://my-bucket/dataset/yyyy/mm/dd/device or do you query primarily by device? If so, organize your buckets as s3://my-bucket/dataset/device/yyyy/mm/dd

Glue ETL

  • Stands for Extract, Transform and Load.
    • It allows you to transform data, Clean Data, Enrich Data (before doing analysis/ training a ML model on it)
      • Input a table and it will output another table in an S3 bucket for example
    • Generate ETL code in Python or Scala, you can modify the code directly
    • Can provide your own Spark or PySpark scripts
    • Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
  • Fully managed, cost effective**, pay only for the resources** consumed
  • Jobs are run on a serverless Spark platform
    • Do not need to pay for provisioning the Spark Cluster
  • Glue Scheduler to schedule the jobs
  • Glue Triggers to automate job runs based on "events"

Glue ETL provides Transformations. These transformations can be:

  • Bundled Transformations:

    • DropFields, DropNullFields - remove (null) fields
    • Filter - specify a function to filter records
    • Join - to enrich data
    • Map - add fields, delete fields, perform external lookups
  • Machine Learning Transformations:

    • FindMatches ML transformer: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
    • Can use ML to do some de-duplication
  • Format conversions: CSV, JSON, Avro, Parquet, ORC, XML

  • Apache Spark transformations (example: K-Means algorithm)

Athena Integration

There is good integration between Athena and glue. You can see the Glue Generated Database and the tables generated there. You can carry out SQL queries on our JSON data, parquet data etc. Athena responds with what we need.

Athena, allows us to run SQL commands directly without provisioning any servers, against data that sits in S3, but for that data in S3 to appear in Athena, you need to have the glue data catalog making sure that the tables appear here, thanks to the crawlers, before Athena is able to understand exactly what is the schema of these tables and query them.

AWS Batch vs Glue

Glue:

  • Glue ETL - Run Apache Spark code (Scala or Python based) focus on the . ETL
  • Glue ETL - Do not worry about configuring or managing the resources
  • Data Catalog to make the data available to Athena or Redshift Spectrum
  • Here all the resources belong to AWS

Batch:

  • For any computing job regardless of the job (must provide Docker image)
    • NOT just ETL.
    • For anything that is batch oriented
    • for example, a job that may be good for batch would be a cleanup task in an S3 bucket.
    • Run a batch script every so often to clean up
  • Resources are created in your account, managed by Batch
    • Have access to all that stuff in your account
  • For any non-ETL related work, Batch is probably better

AWS DMS vs Glue

Glue:

  • Glue ETL - Run Apache Spark code, Scala or Python based, focus on the ETL
  • Glue ETL - Do not worry about configuring or managing the resources
  • Data Catalog to make the data available to Athena or Redshift Spectrum

AWS DMS:

  • Continuous Data Replication
    • Glue is Batch oriented. Minimum time on Glue is 5 minutes
  • No data transformation
  • Once the data is in AWS, you can use Glue to transform it

Resources: