Exploratory Data Analysis
Data engineering got your data where it needs to be but the algorithms you use for machine learning often require your data to be in a specific format and not to have any missing data in it. It's also essential to understand your data before you start training algorithms with it, that's what exploratory data analysis is all about. This may include:
Sanitize and prepare data for modeling.
- Identify and handle missing data, corrupt data, stop words, etc.
- Formatting, normalizing, augmenting, and scaling data
- Labeled data (recognizing when you have enough labeled data and identifying mitigation strategies)
- Data labeling tools (Mechanical Turk, manual labor)
Perform Feature Engineering.
- Identify and extract features from data sets, including from data sources such as text, speech, image, public datasets, etc.
- Analyze/evaluate feature engineering concepts (binning, tokenization, outliers, synthetic features, 1 hot encoding, reducing dimensionality of data)
Analyze and visualize data for machine learning.
- Graphing (scatter plot, time series, histogram, box plot)
- Interpreting descriptive statistics (correlation, summary statistics, p value)
- Clustering (hierarchical, diagnosing, elbow plot, cluster size)