Cloud & MLOps ☁️
Feature Engineering

Feature Engineering

This is the process of applying your knowledge of the data - and the model you're using - to create better features to train your model with. This is covered in the AI Wiki.

AWS in particular, talks about its own Random Cut Forest algorithm a lot - it creeps into many of its services. It is made for outlier detection. It is found within QuickSight, Kinesis Analytics, SageMaker, and more.

SageMaker Ground Truth

Annotation consolidation helps to improve the accuracy of your data object's labels. It combines the results of multiple worker's annotation tasks into one high-fidelity label. Automated data labelling uses machine learning to label portions of your data automatically without having to send them to human workers.

The idea is that if you have missing data that can be inferred easily by a human, Ground Truth lets you farm these tasks out to actual human beings to fill in that missing data for you. Sometimes you don't have training data at all, and it needs to be generated by humans first.

Example: training an image classification model. Somebody needs to tag a bunch of images with what they are images of before training a neural network. Ground Truth manages humans who will label your data for training purposes

But it's more than that!

  • Ground Truth creates its own model as images are labeled by people
  • As this model learns, only images the model isn't sure about are sent to human labelers going forward
  • This can reduce the cost of labeling jobs by 70%

Who are these human labelers?

  • Mechanical Turk

    • a huge workforce of people around the world who will label your data and do a bunch of other simple tasks for you for a very small amount of money
  • Farm out to your own internal team

  • Professional labeling companies

Ground Truth Plus

  • Turnkey solution, without any tinkering

  • "Our team of AWS Experts" manages the workflow and team of labelers

    • You fill out an intake form

    • Hire someone on AWS to set it all up and manage the whole project for you

    • They contact you and discuss pricing

  • You track progress via the Ground Truth Plus Project Portal

  • Get labeled data from S3 when you are done

Other ways to generate training labels

  • Rekognition service
    • Pretrained model
    • AWS service for image recognition
    • Automatically classify images
  • Comprehend
    • For textual information
    • AWS service for text analysis and topic modeling
    • Automatically classify text by topics, sentiment
  • Any pre-trained model or unsupervised technique that may be helpful

Resources: