Data Mining Principles TA Session 5 (February 12, 2021)

Agenda

Decision Trees
Classification and Regression Trees (CART) algorithm
Support Vector Machines (SVM)

Decision Trees

DT are versatile Machine Learning algorithms that can perform both classification and regression tasks
DT are also the fundamental components of Random Forests
DT require very little data preparation. In fact, they don’t require feature scaling or centering at all

Decision Trees

DT also perform regressions
The main difference is that instead of predicting a class in each node, it predicts a value

Elements of Decision Trees

Root node - the top of the tree, all training dara
Final subgroups - terminal nodes or leafs
Nodes (leafs) in the middle - internal nodes
Connections between nodes - branches

Hyperparameters of Decision Trees

max_depth (the maximum depth of the tree)
min_samples_split (the minimum number of samples a node must have before it can be split)
min_samples_leaf (the minimum number of samples a leaf node must have)
min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighted instances)
max_leaf_nodes (the maximum number of leaf nodes)
max_features (the maximum number of features that are evaluated for splitting at each node)
impurity (the function to measure the quality of a split)

Impurity

Node’s gini attribute measures its impurity: a node is “pure” (gini = 0) if all training instances it applies to belong to the same class
Two measurements - Gini and entropy
The gini impurity measures the frequency at which any element of the dataset will be mislabelled when it is randomly labeled
Entropy is a measure of information that indicates the disorder of the features with the target
The lower the impurity measure - the better

Advantages of Decision Trees

Simple, easy to explain, very fast
A good starting point of the analysis
Minimum preprocessing of data
Some algorithms can handle missing data

Disadvantages of Decision Trees

Do not achieve a state-of-the-art results
Deep trees tend to have high variance (and low bias) and shallow trees tend to be overly bias (but low variance)
Small change in data can give you completely different results

CART

There are many methodologies for constructing decision trees but the most well-known is the classification and regression tree (CART) algorithm proposed by Breiman (1984)
The algorithm works by first splitting the training set into two subsets using a single feature k and a threshold t (e.g., “petal length ≤ 2.45 cm”)
Once the CART algorithm has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively
It stops recursing once it reaches the maximum depth or if it cannot find a split that will reduce impurity

CART

CART algorithm is a greedy algorithm
It greedily searches for an optimum split at the top level, then repeats the process at each subsequent level
It does not check whether or not the split will lead to the lowest possible impurity several levels down
A greedy algorithm often produces a solution that’s reasonably good but not guaranteed to be optimal

Support Vector Machines (SVM)

SVM offers a direct approach to binary classification - find a hyperplane in some feature space that “best” separates the two classes

But in reality it is difficult to achieve
SVM solutions: loosen the “perfectly separates” and use a “kernel trick”

Hard Margin Classifier

HMC draws a boundary that provides the maximum separation between the two classes
The “optimal” separating hyperplane
Separates the two classes by maximizing the distance between them

Hard Margin Cassifier

Black line - decision boundary (a separating hyperplane)
Dashed line - dashed lines form the boundaries of the margins (shaded regions) on each side of the hyperplane
HMC finds the separating hyperplane that provides the largest margin/gap between the two classes

Soft Margin Classifier

In some situations HMC will not perform well
Find a good balance between keeping the street as large as possible and limiting the margin violations
We can loosen the constraints (or soften the margin) by allowing some points to be on the wrong side of the margin; this is referred to as the the soft margin classifier
SMC finds the separating hyperplane that provides the largest margin/gap between the two classes, but allows for some of the points to cross over the margin boundaries
C is the allowable budget for the total amount of overlap and is our first tunable hyperparameter for the SVM
C is a parameter, which allows us to control how tight/loosen is our classifier

Examples of Different Cs

Left: Zero budget for overlap (i.e., the HMC).
Right: Maximumn allowable overlap. The solid black points represent the support vectors that define the margin boundaries
If your SVM model is overfitting, you can try regularizing it by reducing C

Kernel Trick

Linear decisions are rare in our real life
Kernel trick helps to solve the problem of nonlinear relationships

In essence, SVMs use the kernel trick to enlarge the feature space using basis functions
In this enlarged (kernel induced) feature space, a hyperplane can often separate the two classes

Kernel Functions for SVM

The radial basis kernel is extremely flexible and as a rule of thumb, we generally start with this kernel when fitting SVMs in practice

Example

SVM Regression

Instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations
The width of the street is controlled by a hyperparameter, ϵ

Multiclass classification

SVM is applicable to only two classes
Two approaches: one-versus-all (OVA) and one-versus-one (OVO)
In OVA, we fit an SVM for each class (one class versus the rest) and classify to the class for which the margin is the largest
In OVO, we fit all pairwise SVMs and classify to the class that wins the most pairwise competitions

Additional information

Scaling can improve the performance of SVM
The left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal
After scaling the right plot looks much better

Advantages of SVM

Try to maximize generalizability. Since SVMs are essentially just convex optimization problems, we’re always guaranteed to find a global optimum
By softening the margin using a budget (or cost) parameter (𝐶), SVMs are relatively robust to outliers (opinions vary)
And finally, using kernel functions, SVMs are flexible enough to adapt to complex nonlinear decision boundaries

Disadvantages of SVM

Slow
Need extra work to handle multinomial classifications

Interactive SVM tutorials

Sources

Hands-On Machine Learning with R (Boehmke & Greenwell)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)
Applied Predictive Modeling (Johnson & Kuhn)
An Introduction to Statistical Learning: With Applications in R (James et al.)
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)
Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)