Data Mining Principles TA Session 4 (February 4, 2021)

Agenda

  • Classification and link functions

  • Linear Classifiers

  • Feature Engineering

  • Training, Test, Validation

  • Loss function

Classification

  • Objective of classification - predict a categorical outcome (binary or multinomial response)

  • By default, the class with the highest predicted probability becomes the predicted class

Linear Classifier

  • Linear classifier as a two-class classifier that decides class membership by comparing a linear combination of the features to a threshold

  • In two dimensions, a linear classifier is a line

  • A classifier corresponds to a decision boundary, or a hyperplane such that the positive examples lie on one side, and negative examples lie on the other side

  • Linear classifiers compute a linear function of the inputs, and determine whether or not the value is larger than some threshold

Evaluation - Accuracy

Evaluation - Precision and Recall

Evaluation - Area Under Curve

Example

Feature Engineering

  • Feature Engineering - adjusting and reworking the predictors to enable models to better uncover predictor response relationships has been termed feature engineering

  • The word “engineering” implies that we know the steps to take to fix poor performance and to guide predictive improvement

  • We often do not know the best re-representation of the predictors to improve model performance

  • Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations

A Book About Feature Engineering

Types of Feature Engineering

  • Missing values

  • Changing types of data

  • Removal of highly correlated variables

  • Transformation of variables

  • Transforming categorical variables

  • Creating bins

  • Dealing with imbalanced data

Missing Data

  • Structural deficiencies in the data

  • Random occurrences

  • Specific causes

Dealing with Missing Values

  • Visualize or check NaNs in each column

  • Probably no perfect solution - a trade-off and potential bias in both cases

  • If the percentage of missing data is low, it is easier to impute and easier to drop

  • Select your own rule of thumb, if you remove NaN - what is your threshold? Although not a general rule in any sense, 20% missing data within a column might be a good “line of dignity” to observe

  • Some algorithms can deal with missing values - others not

Imputation

  • If you impute, imputing the mean is the most common approach - just fill in all missing data with the mean value

  • Another popular technique - K-nearest neighbor model. A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value

  • Several other techniques are possible

Algorithms and Missing Data

  • Most predictive modeling techniques cannot handle any missing values

  • Some algorithms can handle missing data (CART, Naive Bayes)

  • Others cannot (SVM, neural networks)

  • If you decided to keep missing data for some reason, always check if you algorithm can handle it

Data types

  • Check the type of each column in your dataset

  • Sometimes numeric values can be strings - convert them

Removal of highly correlated variables

  • Sometimes variables are highly correlated - up to -0.99 or 0.99

  • Again, some algorithms can handle (tree-based models) it and some cannot (regressions)

  • Even when a predictive model is insensitive to extra predictors, it makes good scientific sense to include the minimum possible set that provides acceptable results

  • Before doing it - also good to check the feature importance

  • What is the rule of thumb? Again, you decide

Transformation of x variables

  • Normalization - values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling

  • Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation

Transformation of y variable

  • Frequently data (y variable) is not distributed properly (skewed heavily)

  • Always plot your numeric y variable to see the distribution

  • Try to transform (Box-Cox, log, square root)

  • Do not forget to transform it back at the end

Creating bins

  • Might be useful if you have numeric data an want to create groups

  • Age: 20-29, 30-39 etc

Transforming categorical variables

  • One-hot encoding variables

  • Goal: convert categorical variables into numeric ones

  • Example: Education (Bachelor, Master, PhD, etc)

  • Transformation strings into numeric variables might unlock some additional insights

Dealing with Imbalanced data

  • Imbalanced data can have a significant impact on model predictions and performance

  • Usually data is imbalanced, when you have a classification problem

  • Fraud vs non-fraud, cancer vs no cancer, etc

  • It is difficult to make accurate predictions when you have ratio 100:1 or 10:1

  • What is the threshold? Again you decide

Solutions for unbalanced datasets

  • Down-sampling - balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient

  • Up-sampling - used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping

  • There is no absolute advantage of one sampling method over another

Techniques for unbalanced datasets

  • SMOTE - Synthetic Minority Over-Sampling Technique (for classification)

  • SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line

  • A random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k = 5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space

  • SMOGN - Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (for regression)

Training, Test, and Validation

  • Training set - used for training your model

  • Test set - used for testing your model

  • Validation set - involves splitting the training set further to create two parts: a training and a validation sets (holdout)

  • We can then train our model(s) on the new training set and estimate the performance on the validation set

  • Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets

  • As the size of your data set reduces, this concern increases

  • The most commonly used method - k-fold cross-validation

k-fold Cross-Validation

  • How does it work?

Loss function

  • Loss functions - metrics that compare the predicted values to the actual value (the output of a loss function is often referred to as the error or pseudo residual)

  • If predictions deviates too much from actual results, loss function would cough up a very large number

  • There are many loss functions to choose from when assessing the performance of a predictive model, each providing a unique understanding of the predictive accuracy

  • Different for classification and regression

  • For regression - for instance, MAE (mean absolute error) or MSE (mean squared error)

  • For classification - it is commonly binary and multi-categorical cross entropy

Sources

  • Hands-On Machine Learning with R (Boehmke & Greenwell)

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)

  • Applied Predictive Modeling (Johnson & Kuhn)

  • An Introduction to Statistical Learning: With Applications in R (James et al.)

  • Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)

  • Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)