Data Mining Principles TA Session 6 (February 19, 2021)

Agenda

Overfitting, Underfitting, and Generalization Error
Bias - Variance trade-off
Cross-Validation
Ensemble Learning
Bagging and Boosting
Stacking and Blending
Random Forest

Fitting the model

Overfitting - the model performs well on the training data, but it does not generalize well. The model is too complex relative to the amount and noisiness of the training data
Underfitting - the model is too simple to learn the underlying structure of the data

Overfitting and Underfitting

Overfitting - low bias and high variance
Underfitting - high bias and low variance
Overfitting - small training MSE but a large test MSE
Underfitting - large training MSE and small test MSE

Solutions for Overfitting

Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model
Gather more training data
Reduce the noise in the training data (e.g., fix data errors and remove outliers)

Solutions for Underfitting

Select a more powerful model, with more parameters
Feed better features to the learning algorithm (feature engineering)
Reduce the constraints on the model (e.g., reduce the regularization hyperparameter)

Generalization Error

Another way to call the error on test data
The error rate on new cases is called the generalization error (or out-of sample error)
If the training error is low but the generalization error is high, it means that your model is overfitting the training data
It is common to use 80% of the data for training and hold out 20% for testing
If dataset contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error
The model’s generalization error can be expressed as the sum of three very different errors:

Bias - Variance Trade-off

Bias - this part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data
Variance - this part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance and thus overfit the training data
Irreducible error - this part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers)
Increasing a model’s complexity will typically increase its variance and reduce its bias
Conversely, reducing a model’s complexity increases its bias and reduces its variance

Bias

Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict
It measures how far off in general a model’s predictions are from the correct value, which provides a sense of how well a model can conform to the underlying structure of the data

Variance

On the other hand, error due to variance is defined as the variability of a model prediction for a given data point.
However, these models offer their own problems as they run the risk of overfitting to the training data
Although you may achieve very good performance on your training data, the model will not automatically generalize well to unseen data
Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk
Moreover, many algorithms that are capable of achieving high generalization performance have lots of hyperparameters that control the level of model complexity

Cross Validation

Training set - used for training your model
Test set - used for testing your model
Validation set - involves splitting the training set further to create two parts: a training and a validation sets (holdout)
We can then train our model(s) on the new training set and estimate the performance on the validation set
Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets
As the size of your data set reduces, this concern increases
The most commonly used method - k-fold cross-validation

k-fold Cross-Validation

How does it work?

The Number of Folds

How many folds to chose?
Usually 5 or 10
Depends on your data size

Ensemble Learning

Ensembles - combinations of different models or multiple same models, which are combined and the average or the best result is used

Examples of ensembles:
- Bagging
- Boosting
- Pasting

Hard voting classifier

The majority-vote classifier is called a hard voting classifier
This voting classifier often achieves a higher accuracy than the best classifier in the ensemble

Hard Voting Classifier

Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms
This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy
Even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy)

Soft Voting Classifier

If all classifiers are able to estimate class probabilities, then you can predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

Bagging

Bootstrap aggregating, also called bagging, is one of the first ensemble algorithms machine learning practitioners learn
The decision trees suffer from high variance
By model averaging, bagging helps to reduce variance and minimize overfitting
Bootstrap aggregating (bagging) prediction models is a general method for fitting multiple versions of a prediction model and then combining (or ensembling) them into an aggregated prediction
Optimal performance is often found by bagging 50–500 trees

Bagging

Uses the same algorithm and train them on random subsets
Uses sampling with replacement
Samples several times for the same predictor

Bagging

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors
The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression
Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.
Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced

Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all
By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the training set
This means that only about 63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors
Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set

Example

Boosting

Boosting - any Ensemble method that can combine several weak learners into a strong learner
The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor
Examples:
ADaBoost
Gradient Boosting
Extreme Gradient Boosting (xgboost)
Light Gradient Boosting Machine (LGBM)

How does boosting work?

The main idea of boosting is to add new models to the ensemble sequentially
Boosting attacks the bias-variance-tradeoff by starting with a weak model and sequentially boosts its performance by continuing to build new trees, where each new tree in the sequence tries to fix up where the previous one made the biggest mistakes (i.e., each new tree in the sequence will focus on the training rows where the previous tree had the largest prediction errors)
Boosting is a framework that iteratively improves any weak learning model
A weak model is one whose error rate is only slightly better than random guessing. The idea behind boosting is that each model in the sequence slightly improves upon the performance of the previous one

Boosting Drawbacks

The main drawback - it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated
As a result, it does not scale as well as bagging or pasting

Boosting hyperparameters

The number of trees
The shrinkage parameter λ, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem
The number d of splits in each tree, which controls the complexity of the boosted ensemble

Stacking and Blending

Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0)

Stacking and Blending

To train the blender, a common approach is to use a hold-out set
First, the training set is split into two subsets. The first subset is used to train the predictors in the first layer

Stacking and Blending

Next, the first layer’s predictors are used to make predictions on the second (held out) set
This ensures that the predictions are “clean,” since the predictors never saw these instances during training

Stacking and Blending

It is possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression), to get a whole layer of blenders

Random Forest

Random Forest is an ensemble of Decision Trees, generally trained via the bagging method
The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features
The algorithm results in greater tree diversity, which (again) trades a higher bias for a lower variance, generally yielding an overall better model

Hyperparameters

Each algorithm has hyperparameters
For instance, Random Forest can specify the number of trees
The number of trees needs to be sufficiently large to stabilize the error rate
Tree complexity - node size, max depth, etc

Sampling

The default sampling scheme for random forests is bootstrapping where 100% of the observations are sampled with replacement
Decreasing the sample size leads to more diverse trees and thereby lower between-tree correlation, which can have a positive effect on the prediction accuracy
Consequently, if there are a few dominating features in your data set, reducing the sample size can also help to minimize between-tree correlation

Sources

Hands-On Machine Learning with R (Boehmke & Greenwell)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)
Applied Predictive Modeling (Johnson & Kuhn)
An Introduction to Statistical Learning: With Applications in R (James et al.)
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)
Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)