Data Mining Principles TA Session 4 (February 4, 2021)
Agenda
Classification and link functions
Linear Classifiers
Feature Engineering
Training, Test, Validation
Loss function
Classification
- Objective of classification - predict a categorical outcome (binary or multinomial response)
- By default, the class with the highest predicted probability becomes the predicted class
Types of link functions in classification
Sigmoid:
σ(x) = 1/(1 + exp(-x))
Logit:
logit(x) = log(x / (1 – x))
Probit:
Linear Classifier
Linear classifier as a two-class classifier that decides class membership by comparing a linear combination of the features to a threshold
In two dimensions, a linear classifier is a line
A classifier corresponds to a decision boundary, or a hyperplane such that the positive examples lie on one side, and negative examples lie on the other side
Linear classifiers compute a linear function of the inputs, and determine whether or not the value is larger than some threshold
Evaluation - Accuracy
Evaluation - Precision and Recall
Evaluation - Area Under Curve
Example
Feature Engineering
Feature Engineering - adjusting and reworking the predictors to enable models to better uncover predictor response relationships has been termed feature engineering
The word “engineering” implies that we know the steps to take to fix poor performance and to guide predictive improvement
We often do not know the best re-representation of the predictors to improve model performance
Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations
A Book About Feature Engineering
Types of Feature Engineering
Missing values
Changing types of data
Removal of highly correlated variables
Transformation of variables
Transforming categorical variables
Creating bins
Dealing with imbalanced data
Missing Data
Structural deficiencies in the data
Random occurrences
Specific causes
Dealing with Missing Values
Visualize or check NaNs in each column
Probably no perfect solution - a trade-off and potential bias in both cases
If the percentage of missing data is low, it is easier to impute and easier to drop
Select your own rule of thumb, if you remove
NaN
- what is your threshold? Although not a general rule in any sense, 20% missing data within a column might be a good “line of dignity” to observeSome algorithms can deal with missing values - others not
Imputation
If you impute, imputing the mean is the most common approach - just fill in all missing data with the mean value
Another popular technique - K-nearest neighbor model. A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value
Several other techniques are possible
Algorithms and Missing Data
Most predictive modeling techniques cannot handle any missing values
Some algorithms can handle missing data (CART, Naive Bayes)
Others cannot (SVM, neural networks)
If you decided to keep missing data for some reason, always check if you algorithm can handle it
Data types
Check the type of each column in your dataset
Sometimes numeric values can be strings - convert them
Transformation of x variables
- Normalization - values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling
- Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation
Transformation of y variable
Frequently data (y variable) is not distributed properly (skewed heavily)
Always plot your numeric y variable to see the distribution
Try to transform (Box-Cox, log, square root)
Do not forget to transform it back at the end
Creating bins
Might be useful if you have numeric data an want to create groups
Age: 20-29, 30-39 etc
Transforming categorical variables
One-hot encoding variables
Goal: convert categorical variables into numeric ones
Example: Education (Bachelor, Master, PhD, etc)
Transformation strings into numeric variables might unlock some additional insights
Dealing with Imbalanced data
Imbalanced data can have a significant impact on model predictions and performance
Usually data is imbalanced, when you have a classification problem
Fraud vs non-fraud, cancer vs no cancer, etc
It is difficult to make accurate predictions when you have ratio 100:1 or 10:1
What is the threshold? Again you decide
Solutions for unbalanced datasets
Down-sampling - balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient
Up-sampling - used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping
There is no absolute advantage of one sampling method over another
Techniques for unbalanced datasets
SMOTE - Synthetic Minority Over-Sampling Technique (for classification)
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line
A random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k = 5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space
SMOGN - Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (for regression)
Training, Test, and Validation
Training set - used for training your model
Test set - used for testing your model
Validation set - involves splitting the training set further to create two parts: a training and a validation sets (holdout)
We can then train our model(s) on the new training set and estimate the performance on the validation set
Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets
As the size of your data set reduces, this concern increases
The most commonly used method - k-fold cross-validation
k-fold Cross-Validation
- How does it work?
Loss function
Loss functions - metrics that compare the predicted values to the actual value (the output of a loss function is often referred to as the error or pseudo residual)
If predictions deviates too much from actual results, loss function would cough up a very large number
There are many loss functions to choose from when assessing the performance of a predictive model, each providing a unique understanding of the predictive accuracy
Different for classification and regression
For regression - for instance, MAE (mean absolute error) or MSE (mean squared error)
For classification - it is commonly binary and multi-categorical cross entropy
Sources
Hands-On Machine Learning with R (Boehmke & Greenwell)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)
Applied Predictive Modeling (Johnson & Kuhn)
An Introduction to Statistical Learning: With Applications in R (James et al.)
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)
Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)