Data Cleaning and Normalization

Data cleaning is crucial for any machine learning project. The quality of your results is highly dependent on how clean and well-understood your raw input data is. Data usually comes in messy. Dirty data can lead to incorrect results, affecting business decisions. As a data scientist, you'll most likely spend more time cleaning data than on algorithm experimentation.

Common Data Issues

Outliers: Unusual data points, like a bot inflating website traffic, should be removed.
Missing Data: Handle absent information wisely - before just discarding it, investigate why it is missing, maybe even try categorizing it.
Malicious Data: Be alert to intentional data manipulation, especially in systems like recommender engines.
Erroneous Data: Errors can be due to software bugs. Investigate if data looks suspicious.
Irrelevant Data: Discard data unrelated to your project, like data from different geographical regions if you're focusing on one.
Inconsistent Data: Similar data in different formats must be normalized; for example, address fields, names in different languages describing the same item etc.
Formatting: Be mindful of different data formats (like date formats) and unify them.

Remember: Garbage in, garbage out. Quality data is often better than fancy algorithms.

Always Question Your Results - don't validate data based on how much you like the results. Constant scrutiny is essential for unbiased outcomes. Even seemingly simple tasks like finding popular pages on a website require intense data cleanup.

Normalizing Numerical Data

Normalization makes sure all your data features have similar scales. This makes your machine learning model work better. Imagine you're comparing ages and incomes. Age ranges from 0-100, while income can go up to billions. If you don't normalize, some data like income could overshadow others like age, leading to a skewed model and a bias in the attributes. Overall, we normalize so:

Comparable Data: Puts everything on an even playing field.
Better Performance: Helps certain models work more accurately.
Avoids Bias: Ensures no single feature dominates the model.

Types of Normalization

Z-score Normalization: Focuses on how far each point is from the mean in terms of standard deviation.

Z = \frac{(X - \text{mean})}{\text{std. deviation}}

Min-Max Scaling: Transforms data to fit within a specific range, usually 0 to 1.

X' = \frac{(X - \text{Min})}{(\text{Max} - \text{Min})}

When to Normalize?

Always check the documentation for the specific model you're using. Some models need normalized data; others don't. In scikit-learns PCA implementation (opens in a new tab), they have a whiten parameter that will automatically normalize your data for you. It also has some preprocessing modules available that will normalize and scale things for you automatically as well.

Special Cases

Textual to Numerical: Convert 'yes' or 'no' answers to 1 or 0.
Rescaling Results: If you scaled down data, remember to scale it back up to make sense of your results.
One Hot Encoding: For Categorical Data

In summary, normalization is not always required, but when it is, it's crucial for your model's performance. Always refer to documentation to know when and how to normalize.

K-fold Cross validation Detecting outliers