Feature Engineering

This is the process of applying your knowledge of the data - and the model you're using - to create better features to train your model with.

How do I handle missing data? Unbalanced data? Outliers?
Which features should I use?
Do I need to transform these features in some way?
Should I create new features from the existing ones?
- perhaps the numerical trends in the data that you have for a given feature a better represented by taking the log of it or the square of it
- Maybe you're better off taking several features and combining them mathematically into one to reduce your dimensionality.
You can't just throw in raw data and expect good results
This is the art of machine learning; where expertise is applied
- "Applied machine learning is basically feature engineering" - Andrew Ng

Imputing Missing Data

Mean Replacement

Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.)
Fast & easy, won't affect mean or sample size of overall data set
Median may be a better choice than mean when outliers are present
- These outliers will skew the mean
But it's generally pretty terrible.
- Only works on column level, misses correlations between features
  - If there is a relationship between age and income, you may miss this. E.g. could say a 10 year old is earning 50K a year. It is a very naïve approach.
- Can't use on categorical features (imputing with most frequent value can work in this case, though)
- Not very accurate - ham handed attempt at imputation

Dropping

If:
- not many rows contain missing data
- dropping those rows doesn't bias your data
  - What if there's an actual relationship between which rows are missing data and some other attribute of those observations
  - E.g., let's say that we're looking at income, again; there might be a situation where people that have very high or very low incomes are more likely to not report it
  - by removing, or dropping all of those observations, you're actually removing a lot of people that have very high or low incomes from your model and that might have a very bad effect on the accuracy of the model you end up with.
- And you don't have a lot of time
- It could be a reasonable thing to do.
But it's never going to be the right answer for the "best" approach.
Almost anything is better
Can you substitute another similar field perhaps? (i.e., review summary vs. full text)

Machine Learning

The thing you probably really want to do in production use machine learning itself to impute your missing data into your machine learning training

KNN: Find K "nearest" (most similar) rows and average their values
- Assumes numerical data, not categorical
- There are ways to handle categorical data (Hamming distance), but categorical data is probably better served by Deep Learning
Deep Learning
- Build a machine learning model to impute data for your machine learning model
- Works well for categorical data. Really well
- But it's complicated.
- Deep learning is better suited to the imputation of categorical data. Square footage is numerical, which is better served by kNN. While simply dropping rows of missing data or using the mean values are a lot easier, they won't result in the best results.
Regression
- Find linear or non-linear relationships between the missing feature and other features
- Most advanced technique: MICE (Multiple Imputation by Chained Equations)

Just Get More Data

What's better than imputing data? Getting more real data! Sometimes you just have to try harder or collect more data so that you can just not have to worry about all the rows that have missing data. Again, you want to be careful that if you are dropping data that you're not biasing your data set in some way, but really, the best way to deal with not having enough data is to just get more of it.

Handling Unbalanced Data

What is unbalanced data?

When we have large discrepancy between "positive" and "negative" cases in our training data, i.e., fraud detection.
- Fraud is rare, and most rows will be not- fraudulent
- This can lead to a difficulty in actually building a model that can identify fraud, because it had so few data points to learn from compared to all of the non-fraud data points, so it's very easy for a model to say, since fraud actually only happens like .01 percent of the time, just predict that it's not fraud all the time and hey, my accuracy is awesome now
- Don't let the terminology confuse you; "positive" doesn't mean "good". It means the thing you're testing for is what happened.
- If your machine learning model is made to detect fraud, then fraud is the positive case.
Mainly a problem with neural networks

Oversampling

Duplicate samples from the minority class. Just fabricate more of your minority case by making copies of other samples from that minority case. Can be done at random

Undersampling

Instead of creating more positive samples, remove negative ones. Instead of creating more of your minority cases, remove the majority ones, so in the case of fraud, we'd be talking about just removing some of those non fraudulent cases to balance it out a little bit.

Throwing data away is usually not the right answer though, unless you are specifically trying to avoid "big data" scaling issues.

SMOTE

Synthetic Minority Over-sampling TEchnique
Artificially generates new samples of the minority class using nearest neighbors
- Run K-nearest-neighbors of each sample of the minority class
- Create a new sample from the KNN result (mean of the neighbors)
so instead of just naively making copies of other test cases for the minority class, this actually fabricates new ones based on averages from other samples
Both generates new samples and undersamples majority class
Generally better than just oversampling

Adjusting thresholds

When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you'll flag something as the positive case (fraud)
- a simpler approach is just adjusting the thresholds when you're actually making inferences and actually applying your model to the data that you have
If you have too many false positives, one way to fix that is to simply increase that threshold.
- Guaranteed to reduce false positives
- But, could result in more false negatives
- you need to think about the cost of a false positive, versus a false negative and choose your thresholds accordingly.

Handling Outliers

Mathematical Background

Variance measures how "spread-out" the data is. This is usually used as a way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual. You can talk about how extreme a data point is by talking about "how many sigmas" away from the mean it is.

Dealing with Outliers

Sometimes it's appropriate to remove outliers from your training data Do this responsibly! Understand why you are doing this.

For example: in collaborative filtering, a single user who rates thousands of movies could have a big effect on everyone else's ratings. That may not be desirable. Another example: in web log data, outliers may represent bots or other agents that should be discarded. But if someone really wants the mean income of US citizens for example, don't toss out billionaires just because you want to.

Dealing with Outliers

Standard deviation provides a principled way to classify outliers. Find data points more than some multiple of a standard deviation in your training data. What multiple? You just have to use common sense.

AWS's Random Cut Forest algorithm creeps into many of its services - it is made for outlier detection. It is found within QuickSight, Kinesis Analytics, SageMaker, and more.

Binning

Transforms numeric data to ordinal data.
Bucket observations together based on ranges of values.
Example: estimated ages of people
- Put all 20-somethings in one classification, 30-somethings in another, etc.
Quantile binning categorizes data by their place in the data distribution
- Ensures that every one of your bins has an equal number of samples within them.
Especially useful when there is uncertainty in the measurements
- If the measurements aren't precise, you aren't adding any information by saying this person is 20.234 years old.

Transforming Data

Applying some function to a feature to make it better suited for training algrithms
Feature data with an exponential trend may benefit from a logarithmic transform to make it more linear
Example: YouTube recommendations
- A numeric feature $x$ is also represented by $x^2$ and $\sqrt{x}$
- This allows learning of super and sub-linear functions
- Can read Covington, P., Adams, J. and Sargin, E. (2016). Deep Neural Networks for YouTube Recommendations (opens in a new tab)
- Of course there is also the curse of dimensionality so there is a limit to that

Encoding

Transforming data into some new representation required by the model
E.g., One-hot encoding
- Create "buckets" for every category
- The bucket for your category has a 1, all others have a 0
- Very common in deep learning, where categories are represented by individual output "neurons"

Scaling / Normalization

Some models prefer feature data to be normally distributed around 0 (most neural nets)
Most models require feature data to at least be scaled to comparable values
- Otherwise features with larger magnitudes will have more weight than they should
- Example: modeling age and income as features - incomes will be much higher values than ages
Scikit_learn has a preprocessor module that helps (MinMaxScaler, etc)
Remember to scale your results back up

Shuffling

Sometimes there's sort of a residual signal in your training data resulting from the order in which that data was collected
Many algorithms benefit from shuffling their training data
Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected

Introduction Feature Selection