# Data Mining Principles TA Session 4 (February 4, 2021)

## Agenda

Classification and link functions

Linear Classifiers

Feature Engineering

Training, Test, Validation

Loss function

## Classification

- Objective of classification - predict a categorical outcome (binary or multinomial response)

- By default, the class with the highest predicted probability becomes the predicted class

## Types of link functions in classification

Sigmoid:

`σ(x) = 1/(1 + exp(-x))`

Logit:

`logit(x) = log(x / (1 – x))`

Probit:

## Linear Classifier

Linear classifier as a two-class classifier that decides class membership by comparing a linear combination of the features to a threshold

In two dimensions, a linear classifier is a line

A classifier corresponds to a decision boundary, or a hyperplane such that the positive examples lie on one side, and negative examples lie on the other side

Linear classifiers compute a linear function of the inputs, and determine whether or not the value is larger than some threshold

## Evaluation - Accuracy

## Evaluation - Precision and Recall

## Evaluation - Area Under Curve

## Example

## Feature Engineering

**Feature Engineering**- adjusting and reworking the predictors to enable models to better uncover predictor response relationships has been termed feature engineeringThe word “engineering” implies that we know the steps to take to fix poor performance and to guide predictive improvement

We often do not know the best re-representation of the predictors to improve model performance

Instead, the re-working of predictors is more of an

**art**, requiring the right tools and experience to find better predictor representations

## A Book About Feature Engineering

## Types of Feature Engineering

Missing values

Changing types of data

Removal of highly correlated variables

Transformation of variables

Transforming categorical variables

Creating bins

Dealing with imbalanced data

## Missing Data

Structural deficiencies in the data

Random occurrences

Specific causes

## Dealing with Missing Values

Visualize or check NaNs in each column

Probably no perfect solution - a trade-off and potential bias in both cases

If the percentage of missing data is low, it is easier to impute and easier to drop

Select your own rule of thumb, if you remove

`NaN`

- what is your threshold? Although not a general rule in any sense, 20% missing data within a column might be a good “line of dignity” to observeSome algorithms can deal with missing values - others not

## Imputation

If you impute, imputing the mean is the most common approach - just fill in all missing data with the mean value

Another popular technique - K-nearest neighbor model. A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value

Several other techniques are possible

## Algorithms and Missing Data

Most predictive modeling techniques cannot handle any missing values

Some algorithms can handle missing data (CART, Naive Bayes)

Others cannot (SVM, neural networks)

If you decided to keep missing data for some reason, always check if you algorithm can handle it

## Data types

Check the type of each column in your dataset

Sometimes numeric values can be strings - convert them

## Transformation of x variables

**Normalization**- values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling

**Standardization**is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation

## Transformation of y variable

Frequently data (y variable) is not distributed properly (skewed heavily)

Always plot your numeric y variable to see the distribution

Try to transform (Box-Cox, log, square root)

Do not forget to transform it back at the end

## Creating bins

Might be useful if you have numeric data an want to create groups

Age: 20-29, 30-39 etc

## Transforming categorical variables

One-hot encoding variables

Goal: convert categorical variables into numeric ones

Example: Education (Bachelor, Master, PhD, etc)

Transformation strings into numeric variables might unlock some additional insights

## Dealing with Imbalanced data

Imbalanced data can have a significant impact on model predictions and performance

Usually data is imbalanced, when you have a classification problem

Fraud vs non-fraud, cancer vs no cancer, etc

It is difficult to make accurate predictions when you have ratio 100:1 or 10:1

What is the threshold? Again you decide

## Solutions for unbalanced datasets

Down-sampling - balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient

Up-sampling - used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping

There is no absolute advantage of one sampling method over another

## Techniques for unbalanced datasets

**SMOTE**- Synthetic Minority Over-Sampling Technique (for classification)SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line

A random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k = 5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space

**SMOGN**- Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (for regression)

## Training, Test, and Validation

Training set - used for training your model

Test set - used for testing your model

**Validation set**- involves splitting the training set further to create two parts: a training and a validation sets (holdout)We can then train our model(s) on the new training set and estimate the performance on the validation set

Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets

As the size of your data set reduces, this concern increases

The most commonly used method - k-fold cross-validation

## k-fold Cross-Validation

- How does it work?

## Loss function

**Loss functions**- metrics that compare the predicted values to the actual value (the output of a loss function is often referred to as the error or pseudo residual)If predictions deviates too much from actual results, loss function would cough up a very large number

There are many loss functions to choose from when assessing the performance of a predictive model, each providing a unique understanding of the predictive accuracy

Different for classification and regression

For regression - for instance, MAE (mean absolute error) or MSE (mean squared error)

For classification - it is commonly binary and multi-categorical cross entropy

## Sources

**Hands-On Machine Learning with R**(Boehmke & Greenwell)**Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems**(Geron)**Applied Predictive Modeling**(Johnson & Kuhn)**An Introduction to Statistical Learning: With Applications in R**(James et al.)**Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions**(Taddy)**Feature Engineering and Selection: A Practical Approach for Predictive Models**(Johnson & Kuhn)