# Data Mining Principles TA Session 5 (February 12, 2021)

## Agenda

Decision Trees

Classification and Regression Trees (CART) algorithm

Support Vector Machines (SVM)

## Decision Trees

DT are versatile Machine Learning algorithms that can perform both classification and regression tasks

DT are also the fundamental components of Random Forests

DT require very little data preparation. In fact, they don’t require feature scaling or centering at all

## Decision Trees

DT also perform regressions

The main difference is that instead of predicting a class in each node, it predicts a value

## Elements of Decision Trees

Root node - the top of the tree, all training dara

Final subgroups - terminal nodes or leafs

Nodes (leafs) in the middle - internal nodes

Connections between nodes - branches

## Hyperparameters of Decision Trees

`max_depth`

(the maximum depth of the tree)`min_samples_split`

(the minimum number of samples a node must have before it can be split)`min_samples_leaf`

(the minimum number of samples a leaf node must have)`min_weight_fraction_leaf`

(same as min_samples_leaf but expressed as a fraction of the total number of weighted instances)`max_leaf_nodes`

(the maximum number of leaf nodes)`max_features`

(the maximum number of features that are evaluated for splitting at each node)`impurity`

(the function to measure the quality of a split)

## Impurity

Node’s gini attribute measures its impurity: a node is “pure” (gini = 0) if all training instances it applies to belong to the same class

Two measurements - Gini and entropy

The gini impurity measures the frequency at which any element of the dataset will be mislabelled when it is randomly labeled

Entropy is a measure of information that indicates the disorder of the features with the target

The lower the impurity measure - the better

## Advantages of Decision Trees

Simple, easy to explain, very fast

A good starting point of the analysis

Minimum preprocessing of data

Some algorithms can handle missing data

## Disadvantages of Decision Trees

Do not achieve a state-of-the-art results

Deep trees tend to have high variance (and low bias) and shallow trees tend to be overly bias (but low variance)

Small change in data can give you completely different results

## CART

There are many methodologies for constructing decision trees but the most well-known is the classification and regression tree (CART) algorithm proposed by Breiman (1984)

The algorithm works by first splitting the training set into two subsets using a single feature k and a threshold t (e.g., “petal length ≤ 2.45 cm”)

Once the CART algorithm has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively

It stops recursing once it reaches the maximum depth or if it cannot find a split that will reduce impurity

## CART

CART algorithm is a greedy algorithm

It greedily searches for an optimum split at the top level, then repeats the process at each subsequent level

It does not check whether or not the split will lead to the lowest possible impurity several levels down

A greedy algorithm often produces a solution that’s reasonably good but not guaranteed to be optimal

## Support Vector Machines (SVM)

- SVM offers a direct approach to binary classification - find a hyperplane in some feature space that “best” separates the two classes

But in reality it is difficult to achieve

SVM solutions: loosen the “perfectly separates” and use a “kernel trick”

## Hard Margin Classifier

HMC draws a boundary that provides the maximum separation between the two classes

The “optimal” separating hyperplane

Separates the two classes by maximizing the distance between them

## Hard Margin Cassifier

Black line - decision boundary (a separating hyperplane)

Dashed line - dashed lines form the boundaries of the margins (shaded regions) on each side of the hyperplane

HMC finds the separating hyperplane that provides the largest margin/gap between the two classes

## Soft Margin Classifier

## Soft Margin Classifier

In some situations HMC will not perform well

Find a good balance between keeping the street as large as possible and limiting the margin violations

We can loosen the constraints (or soften the margin) by allowing some points to be on the wrong side of the margin; this is referred to as the the

**soft margin classifier**SMC finds the separating hyperplane that provides the largest margin/gap between the two classes, but allows for some of the points to cross over the margin boundaries

C is the allowable budget for the total amount of overlap and is our first tunable hyperparameter for the SVM

C is a parameter, which allows us to control how tight/loosen is our classifier

## Examples of Different Cs

Left: Zero budget for overlap (i.e., the HMC).

Right: Maximumn allowable overlap. The solid black points represent the support vectors that define the margin boundaries

If your SVM model is overfitting, you can try regularizing it by reducing C

## Kernel Trick

Linear decisions are rare in our real life

**Kernel trick**helps to solve the problem of nonlinear relationships

In essence, SVMs use the kernel trick to enlarge the feature space using basis functions

In this enlarged (kernel induced) feature space, a hyperplane can often separate the two classes

## Kernel Functions for SVM

- The radial basis kernel is extremely flexible and as a rule of thumb, we generally start with this kernel when fitting SVMs in practice

## Example

## SVM Regression

Instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations

The width of the street is controlled by a hyperparameter, ϵ

## Multiclass classification

SVM is applicable to only two classes

Two approaches: one-versus-all (OVA) and one-versus-one (OVO)

In OVA, we fit an SVM for each class (one class versus the rest) and classify to the class for which the margin is the largest

In OVO, we fit all pairwise SVMs and classify to the class that wins the most pairwise competitions

## Additional information

Scaling can improve the performance of SVM

The left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal

After scaling the right plot looks much better

## Advantages of SVM

Try to maximize generalizability. Since SVMs are essentially just convex optimization problems, we’re always guaranteed to find a global optimum

By softening the margin using a budget (or cost) parameter (𝐶), SVMs are relatively robust to outliers (opinions vary)

And finally, using kernel functions, SVMs are flexible enough to adapt to complex nonlinear decision boundaries

## Disadvantages of SVM

Slow

Need extra work to handle multinomial classifications

## Interactive SVM tutorials

## Sources

**Hands-On Machine Learning with R**(Boehmke & Greenwell)**Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems**(Geron)**Applied Predictive Modeling**(Johnson & Kuhn)**An Introduction to Statistical Learning: With Applications in R**(James et al.)**Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions**(Taddy)**Feature Engineering and Selection: A Practical Approach for Predictive Models**(Johnson & Kuhn)