Data Mining Principles TA Session 4 (February 4, 2021)

Agenda

Classification and link functions
Linear Classifiers
Feature Engineering
Training, Test, Validation
Loss function

Classification

Objective of classification - predict a categorical outcome (binary or multinomial response)

By default, the class with the highest predicted probability becomes the predicted class

Types of link functions in classification

Sigmoid: σ(x) = 1/(1 + exp(-x))
Logit: logit(x) = log(x / (1 – x))
Probit:

Useful material about link functions

Linear Classifier

Linear classifier as a two-class classifier that decides class membership by comparing a linear combination of the features to a threshold
In two dimensions, a linear classifier is a line
A classifier corresponds to a decision boundary, or a hyperplane such that the positive examples lie on one side, and negative examples lie on the other side
Linear classifiers compute a linear function of the inputs, and determine whether or not the value is larger than some threshold

Evaluation - Accuracy

Evaluation - Precision and Recall

Evaluation - Area Under Curve

Example

Feature Engineering

Feature Engineering - adjusting and reworking the predictors to enable models to better uncover predictor response relationships has been termed feature engineering
The word “engineering” implies that we know the steps to take to fix poor performance and to guide predictive improvement
We often do not know the best re-representation of the predictors to improve model performance
Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations

A Book About Feature Engineering

Types of Feature Engineering

Missing values
Changing types of data
Removal of highly correlated variables
Transformation of variables
Transforming categorical variables
Creating bins
Dealing with imbalanced data

Missing Data

Structural deficiencies in the data
Random occurrences
Specific causes

Dealing with Missing Values

Visualize or check NaNs in each column
Probably no perfect solution - a trade-off and potential bias in both cases
If the percentage of missing data is low, it is easier to impute and easier to drop
Select your own rule of thumb, if you remove NaN - what is your threshold? Although not a general rule in any sense, 20% missing data within a column might be a good “line of dignity” to observe
Some algorithms can deal with missing values - others not

Imputation

If you impute, imputing the mean is the most common approach - just fill in all missing data with the mean value
Another popular technique - K-nearest neighbor model. A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value
Several other techniques are possible

Algorithms and Missing Data

Most predictive modeling techniques cannot handle any missing values
Some algorithms can handle missing data (CART, Naive Bayes)
Others cannot (SVM, neural networks)
If you decided to keep missing data for some reason, always check if you algorithm can handle it

Data types

Check the type of each column in your dataset
Sometimes numeric values can be strings - convert them

Removal of highly correlated variables

Sometimes variables are highly correlated - up to -0.99 or 0.99
Again, some algorithms can handle (tree-based models) it and some cannot (regressions)
Even when a predictive model is insensitive to extra predictors, it makes good scientific sense to include the minimum possible set that provides acceptable results
Before doing it - also good to check the feature importance
What is the rule of thumb? Again, you decide

Transformation of x variables

Normalization - values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation

Transformation of y variable

Frequently data (y variable) is not distributed properly (skewed heavily)
Always plot your numeric y variable to see the distribution
Try to transform (Box-Cox, log, square root)
Do not forget to transform it back at the end

Creating bins

Might be useful if you have numeric data an want to create groups
Age: 20-29, 30-39 etc

Transforming categorical variables

One-hot encoding variables
Goal: convert categorical variables into numeric ones
Example: Education (Bachelor, Master, PhD, etc)
Transformation strings into numeric variables might unlock some additional insights

Dealing with Imbalanced data

Imbalanced data can have a significant impact on model predictions and performance
Usually data is imbalanced, when you have a classification problem
Fraud vs non-fraud, cancer vs no cancer, etc
It is difficult to make accurate predictions when you have ratio 100:1 or 10:1
What is the threshold? Again you decide

Solutions for unbalanced datasets

Down-sampling - balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient
Up-sampling - used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping
There is no absolute advantage of one sampling method over another

Techniques for unbalanced datasets

SMOTE - Synthetic Minority Over-Sampling Technique (for classification)
SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line
A random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k = 5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space
SMOGN - Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (for regression)

Training, Test, and Validation

Training set - used for training your model
Test set - used for testing your model
Validation set - involves splitting the training set further to create two parts: a training and a validation sets (holdout)
We can then train our model(s) on the new training set and estimate the performance on the validation set
Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets
As the size of your data set reduces, this concern increases
The most commonly used method - k-fold cross-validation

k-fold Cross-Validation

How does it work?

Loss function

Loss functions - metrics that compare the predicted values to the actual value (the output of a loss function is often referred to as the error or pseudo residual)
If predictions deviates too much from actual results, loss function would cough up a very large number
There are many loss functions to choose from when assessing the performance of a predictive model, each providing a unique understanding of the predictive accuracy
Different for classification and regression
For regression - for instance, MAE (mean absolute error) or MSE (mean squared error)
For classification - it is commonly binary and multi-categorical cross entropy

Sources

Hands-On Machine Learning with R (Boehmke & Greenwell)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)
Applied Predictive Modeling (Johnson & Kuhn)
An Introduction to Statistical Learning: With Applications in R (James et al.)
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)
Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)