Data Mining Principles TA Session 6 (February 19, 2021)

Agenda

  • Overfitting, Underfitting, and Generalization Error

  • Bias - Variance trade-off

  • Cross-Validation

  • Ensemble Learning

  • Bagging and Boosting

  • Stacking and Blending

  • Random Forest

Fitting the model

  • Overfitting - the model performs well on the training data, but it does not generalize well. The model is too complex relative to the amount and noisiness of the training data

  • Underfitting - the model is too simple to learn the underlying structure of the data

Overfitting and Underfitting

  • Overfitting - low bias and high variance

  • Underfitting - high bias and low variance

  • Overfitting - small training MSE but a large test MSE

  • Underfitting - large training MSE and small test MSE

Solutions for Overfitting

  • Simplify the model by selecting one with fewer parameters (e.g., a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model

  • Gather more training data

  • Reduce the noise in the training data (e.g., fix data errors and remove outliers)

Solutions for Underfitting

  • Select a more powerful model, with more parameters

  • Feed better features to the learning algorithm (feature engineering)

  • Reduce the constraints on the model (e.g., reduce the regularization hyperparameter)

Generalization Error

  • Another way to call the error on test data

  • The error rate on new cases is called the generalization error (or out-of sample error)

  • If the training error is low but the generalization error is high, it means that your model is overfitting the training data

  • It is common to use 80% of the data for training and hold out 20% for testing

  • If dataset contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error

  • The model’s generalization error can be expressed as the sum of three very different errors:

Bias - Variance Trade-off

  • Bias - this part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data

  • Variance - this part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance and thus overfit the training data

  • Irreducible error - this part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers)

  • Increasing a model’s complexity will typically increase its variance and reduce its bias

  • Conversely, reducing a model’s complexity increases its bias and reduces its variance

Bias

  • Bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict

  • It measures how far off in general a model’s predictions are from the correct value, which provides a sense of how well a model can conform to the underlying structure of the data

Variance

  • On the other hand, error due to variance is defined as the variability of a model prediction for a given data point.

  • However, these models offer their own problems as they run the risk of overfitting to the training data

  • Although you may achieve very good performance on your training data, the model will not automatically generalize well to unseen data

  • Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk

  • Moreover, many algorithms that are capable of achieving high generalization performance have lots of hyperparameters that control the level of model complexity

Cross Validation

  • Training set - used for training your model

  • Test set - used for testing your model

  • Validation set - involves splitting the training set further to create two parts: a training and a validation sets (holdout)

  • We can then train our model(s) on the new training set and estimate the performance on the validation set

  • Unfortunately, validation using a single holdout set can be highly variable and unreliable unless you are working with very large data sets

  • As the size of your data set reduces, this concern increases

  • The most commonly used method - k-fold cross-validation

k-fold Cross-Validation

  • How does it work?

The Number of Folds

  • How many folds to chose?

  • Usually 5 or 10

  • Depends on your data size

Ensemble Learning

Ensembles - combinations of different models or multiple same models, which are combined and the average or the best result is used

  • Examples of ensembles:

    • Bagging

    • Boosting

    • Pasting

Hard voting classifier

  • The majority-vote classifier is called a hard voting classifier

  • This voting classifier often achieves a higher accuracy than the best classifier in the ensemble

Hard Voting Classifier

  • Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms

  • This increases the chance that they will make very different types of errors, improving the ensemble’s accuracy

  • Even if each classifier is a weak learner (meaning it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy)

Soft Voting Classifier

If all classifiers are able to estimate class probabilities, then you can predict the class with the highest class probability, averaged over all the individual classifiers. This is called soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

Bagging

  • Bootstrap aggregating, also called bagging, is one of the first ensemble algorithms machine learning practitioners learn

  • The decision trees suffer from high variance

  • By model averaging, bagging helps to reduce variance and minimize overfitting

  • Bootstrap aggregating (bagging) prediction models is a general method for fitting multiple versions of a prediction model and then combining (or ensembling) them into an aggregated prediction

  • Optimal performance is often found by bagging 50–500 trees

Bagging

  • Uses the same algorithm and train them on random subsets

  • Uses sampling with replacement

  • Samples several times for the same predictor

Bagging

  • Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors

  • The aggregation function is typically the statistical mode (i.e., the most frequent prediction, just like a hard voting classifier) for classification, or the average for regression

  • Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.

  • Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced

Out-of-Bag Evaluation

  • With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all

  • By default a BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the training set

  • This means that only about 63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances. Note that they are not the same 37% for all predictors

  • Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set

Example

Boosting

  • Boosting - any Ensemble method that can combine several weak learners into a strong learner

  • The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor

  • Examples:

  • ADaBoost

  • Gradient Boosting

  • Extreme Gradient Boosting (xgboost)

  • Light Gradient Boosting Machine (LGBM)

How does boosting work?

  • The main idea of boosting is to add new models to the ensemble sequentially

  • Boosting attacks the bias-variance-tradeoff by starting with a weak model and sequentially boosts its performance by continuing to build new trees, where each new tree in the sequence tries to fix up where the previous one made the biggest mistakes (i.e., each new tree in the sequence will focus on the training rows where the previous tree had the largest prediction errors)

  • Boosting is a framework that iteratively improves any weak learning model

  • A weak model is one whose error rate is only slightly better than random guessing. The idea behind boosting is that each model in the sequence slightly improves upon the performance of the previous one

Boosting Drawbacks

  • The main drawback - it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated

  • As a result, it does not scale as well as bagging or pasting

Boosting hyperparameters

  • The number of trees

  • The shrinkage parameter λ, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem

  • The number d of splits in each tree, which controls the complexity of the boosted ensemble

Stacking and Blending

  • Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0)

Stacking and Blending

  • To train the blender, a common approach is to use a hold-out set

  • First, the training set is split into two subsets. The first subset is used to train the predictors in the first layer

Stacking and Blending

  • Next, the first layer’s predictors are used to make predictions on the second (held out) set

  • This ensures that the predictions are “clean,” since the predictors never saw these instances during training

Stacking and Blending

  • It is possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression), to get a whole layer of blenders

Random Forest

  • Random Forest is an ensemble of Decision Trees, generally trained via the bagging method

  • The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features

  • The algorithm results in greater tree diversity, which (again) trades a higher bias for a lower variance, generally yielding an overall better model

Hyperparameters

  • Each algorithm has hyperparameters

  • For instance, Random Forest can specify the number of trees

  • The number of trees needs to be sufficiently large to stabilize the error rate

  • Tree complexity - node size, max depth, etc

Sampling

  • The default sampling scheme for random forests is bootstrapping where 100% of the observations are sampled with replacement

  • Decreasing the sample size leads to more diverse trees and thereby lower between-tree correlation, which can have a positive effect on the prediction accuracy

  • Consequently, if there are a few dominating features in your data set, reducing the sample size can also help to minimize between-tree correlation

Sources

  • Hands-On Machine Learning with R (Boehmke & Greenwell)

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)

  • Applied Predictive Modeling (Johnson & Kuhn)

  • An Introduction to Statistical Learning: With Applications in R (James et al.)

  • Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)

  • Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)