# Data Mining Principles TA Session 6 (February 19, 2021)

## Agenda

Overfitting, Underfitting, and Generalization Error

Bias - Variance trade-off

Cross-Validation

Ensemble Learning

Bagging and Boosting

Stacking and Blending

Random Forest

## Fitting the model

**Overfitting**- the model performs well on the training data, but it does not generalize well. The model is too complex relative to the amount and noisiness of the training data**Underfitting**- the model is too simple to learn the underlying structure of the data

## Overfitting and Underfitting

Overfitting - low bias and high variance

Underfitting - high bias and low variance

Overfitting - small training MSE but a large test MSE

Underfitting - large training MSE and small test MSE

## Solutions for Overfitting

Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model

Gather more training data

Reduce the noise in the training data (e.g., fix data errors and remove outliers)

## Solutions for Underfitting

Select a more powerful model, with more parameters

Feed better features to the learning algorithm (feature engineering)

Reduce the constraints on the model (e.g., reduce the regularization hyperparameter)

## Generalization Error

Another way to call the error on test data

The error rate on new cases is called the generalization error (or out-of sample error)

If the training error is low but the generalization error is high, it means that your model is overfitting the training data

It is common to use 80% of the data for training and hold out 20% for testing

If dataset contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error

The model’s generalization error can be expressed as the sum of three very different errors:

## Bias - Variance Trade-off

Bias - this part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data

Variance - this part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance and thus overfit the training data

Irreducible error - this part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers)

Increasing a model’s complexity will typically increase its variance and reduce its bias

Conversely, reducing a model’s complexity increases its bias and reduces its variance

## Bias

Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict

It measures how far off in general a model’s predictions are from the correct value, which provides a sense of how well a model can conform to the underlying structure of the data

## Variance

On the other hand, error due to variance is defined as the variability of a model prediction for a given data point.

However, these models offer their own problems as they run the risk of overfitting to the training data

Although you may achieve very good performance on your training data, the model will not automatically generalize well to unseen data

Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk

Moreover, many algorithms that are capable of achieving high generalization performance have lots of hyperparameters that control the level of model complexity

## Cross Validation

Training set - used for training your model

Test set - used for testing your model

**Validation set**- involves splitting the training set further to create two parts: a training and a validation sets (holdout)We can then train our model(s) on the new training set and estimate the performance on the validation set

Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets

As the size of your data set reduces, this concern increases

The most commonly used method - k-fold cross-validation

## k-fold Cross-Validation

- How does it work?

## The Number of Folds

How many folds to chose?

Usually 5 or 10

Depends on your data size

## Ensemble Learning

Ensembles - combinations of different models or multiple same models, which are combined and the average or the best result is used

Examples of ensembles:

Bagging

Boosting

Pasting

## Hard voting classifier

The majority-vote classifier is called a hard voting classifier

This voting classifier often achieves a higher accuracy than the best classifier in the ensemble

## Hard Voting Classifier

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms

This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy

Even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy)

## Soft Voting Classifier

If all classifiers are able to estimate class probabilities, then you can predict the class with the
highest class probability, averaged over all the individual classifiers. This is called **soft
voting**. It often achieves higher performance than hard voting because it gives more
weight to highly confident votes.

## Bagging

Bootstrap aggregating, also called bagging, is one of the first ensemble algorithms machine learning practitioners learn

The decision trees suffer from high variance

By model averaging, bagging helps to reduce variance and minimize overfitting

Bootstrap aggregating (bagging) prediction models is a general method for fitting multiple versions of a prediction model and then combining (or ensembling) them into an aggregated prediction

Optimal performance is often found by bagging 50–500 trees

## Bagging

Uses the same algorithm and train them on random subsets

Uses sampling with replacement

Samples several times for the same predictor

## Bagging

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors

The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression

Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced

## Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all

By default a

`BaggingClassifier`

samples m training instances with replacement`(bootstrap=True)`

, where m is the size of the training setThis means that only about 63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called

**out-of-bag (oob)**instances. Note that they are not the same 37% for all predictorsSince a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set

## Example

## Boosting

Boosting - any Ensemble method that can combine several weak learners into a strong learner

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor

Examples:

ADaBoost

Gradient Boosting

Extreme Gradient Boosting (xgboost)

Light Gradient Boosting Machine (LGBM)

## How does boosting work?

The main idea of boosting is to add new models to the ensemble sequentially

Boosting attacks the bias-variance-tradeoff by starting with a weak model and sequentially boosts its performance by continuing to build new trees, where each new tree in the sequence tries to fix up where the previous one made the biggest mistakes (i.e., each new tree in the sequence will focus on the training rows where the previous tree had the largest prediction errors)

Boosting is a framework that iteratively improves any weak learning model

A weak model is one whose error rate is only slightly better than random guessing. The idea behind boosting is that each model in the sequence slightly improves upon the performance of the previous one

## Boosting Drawbacks

The main drawback - it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated

As a result, it does not scale as well as bagging or pasting

## Boosting hyperparameters

The number of trees

The shrinkage parameter λ, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem

The number d of splits in each tree, which controls the complexity of the boosted ensemble

## Stacking and Blending

- Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0)

## Stacking and Blending

To train the blender, a common approach is to use a hold-out set

First, the training set is split into two subsets. The first subset is used to train the predictors in the first layer

## Stacking and Blending

Next, the first layer’s predictors are used to make predictions on the second (held out) set

This ensures that the predictions are “clean,” since the predictors never saw these instances during training

## Stacking and Blending

- It is possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression), to get a whole layer of blenders

## Random Forest

Random Forest is an ensemble of Decision Trees, generally trained via the bagging method

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features

The algorithm results in greater tree diversity, which (again) trades a higher bias for a lower variance, generally yielding an overall better model

## Hyperparameters

Each algorithm has hyperparameters

For instance, Random Forest can specify the number of trees

The number of trees needs to be sufficiently large to stabilize the error rate

Tree complexity - node size, max depth, etc

## Sampling

The default sampling scheme for random forests is bootstrapping where 100% of the observations are sampled with replacement

Decreasing the sample size leads to more diverse trees and thereby lower between-tree correlation, which can have a positive effect on the prediction accuracy

Consequently, if there are a few dominating features in your data set, reducing the sample size can also help to minimize between-tree correlation

## Sources

**Hands-On Machine Learning with R**(Boehmke & Greenwell)**Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems**(Geron)**Applied Predictive Modeling**(Johnson & Kuhn)**An Introduction to Statistical Learning: With Applications in R**(James et al.)**Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions**(Taddy)**Feature Engineering and Selection: A Practical Approach for Predictive Models**(Johnson & Kuhn)