Data Mining Principles TA Session 5 (February 12, 2021)

Agenda

  • Decision Trees

  • Classification and Regression Trees (CART) algorithm

  • Support Vector Machines (SVM)

Decision Trees

  • DT are versatile Machine Learning algorithms that can perform both classification and regression tasks

  • DT are also the fundamental components of Random Forests

  • DT require very little data preparation. In fact, they don’t require feature scaling or centering at all

Decision Trees

  • DT also perform regressions

  • The main difference is that instead of predicting a class in each node, it predicts a value

Elements of Decision Trees

  • Root node - the top of the tree, all training dara

  • Final subgroups - terminal nodes or leafs

  • Nodes (leafs) in the middle - internal nodes

  • Connections between nodes - branches

Hyperparameters of Decision Trees

  • max_depth (the maximum depth of the tree)

  • min_samples_split (the minimum number of samples a node must have before it can be split)

  • min_samples_leaf (the minimum number of samples a leaf node must have)

  • min_weight_fraction_leaf (same as min_samples_leaf but expressed as a fraction of the total number of weighted instances)

  • max_leaf_nodes (the maximum number of leaf nodes)

  • max_features (the maximum number of features that are evaluated for splitting at each node)

  • impurity (the function to measure the quality of a split)

Impurity

  • Node’s gini attribute measures its impurity: a node is “pure” (gini = 0) if all training instances it applies to belong to the same class

  • Two measurements - Gini and entropy

  • The gini impurity measures the frequency at which any element of the dataset will be mislabelled when it is randomly labeled

  • Entropy is a measure of information that indicates the disorder of the features with the target

  • The lower the impurity measure - the better

Advantages of Decision Trees

  • Simple, easy to explain, very fast

  • A good starting point of the analysis

  • Minimum preprocessing of data

  • Some algorithms can handle missing data

Disadvantages of Decision Trees

  • Do not achieve a state-of-the-art results

  • Deep trees tend to have high variance (and low bias) and shallow trees tend to be overly bias (but low variance)

  • Small change in data can give you completely different results

CART

  • There are many methodologies for constructing decision trees but the most well-known is the classification and regression tree (CART) algorithm proposed by Breiman (1984)

  • The algorithm works by first splitting the training set into two subsets using a single feature k and a threshold t (e.g., “petal length ≤ 2.45 cm”)

  • Once the CART algorithm has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively

  • It stops recursing once it reaches the maximum depth or if it cannot find a split that will reduce impurity

CART

  • CART algorithm is a greedy algorithm

  • It greedily searches for an optimum split at the top level, then repeats the process at each subsequent level

  • It does not check whether or not the split will lead to the lowest possible impurity several levels down

  • A greedy algorithm often produces a solution that’s reasonably good but not guaranteed to be optimal

Support Vector Machines (SVM)

  • SVM offers a direct approach to binary classification - find a hyperplane in some feature space that “best” separates the two classes

  • But in reality it is difficult to achieve

  • SVM solutions: loosen the “perfectly separates” and use a “kernel trick”

Hard Margin Classifier

  • HMC draws a boundary that provides the maximum separation between the two classes

  • The “optimal” separating hyperplane

  • Separates the two classes by maximizing the distance between them

Hard Margin Cassifier

  • Black line - decision boundary (a separating hyperplane)

  • Dashed line - dashed lines form the boundaries of the margins (shaded regions) on each side of the hyperplane

  • HMC finds the separating hyperplane that provides the largest margin/gap between the two classes

Soft Margin Classifier

Soft Margin Classifier

  • In some situations HMC will not perform well

  • Find a good balance between keeping the street as large as possible and limiting the margin violations

  • We can loosen the constraints (or soften the margin) by allowing some points to be on the wrong side of the margin; this is referred to as the the soft margin classifier

  • SMC finds the separating hyperplane that provides the largest margin/gap between the two classes, but allows for some of the points to cross over the margin boundaries

  • C is the allowable budget for the total amount of overlap and is our first tunable hyperparameter for the SVM

  • C is a parameter, which allows us to control how tight/loosen is our classifier

Examples of Different Cs

  • Left: Zero budget for overlap (i.e., the HMC).

  • Right: Maximumn allowable overlap. The solid black points represent the support vectors that define the margin boundaries

  • If your SVM model is overfitting, you can try regularizing it by reducing C

Kernel Trick

  • Linear decisions are rare in our real life

  • Kernel trick helps to solve the problem of nonlinear relationships

  • In essence, SVMs use the kernel trick to enlarge the feature space using basis functions

  • In this enlarged (kernel induced) feature space, a hyperplane can often separate the two classes

Kernel Functions for SVM

  • The radial basis kernel is extremely flexible and as a rule of thumb, we generally start with this kernel when fitting SVMs in practice

Example

SVM Regression

  • Instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations

  • The width of the street is controlled by a hyperparameter, ϵ

Multiclass classification

  • SVM is applicable to only two classes

  • Two approaches: one-versus-all (OVA) and one-versus-one (OVO)

  • In OVA, we fit an SVM for each class (one class versus the rest) and classify to the class for which the margin is the largest

  • In OVO, we fit all pairwise SVMs and classify to the class that wins the most pairwise competitions

Additional information

  • Scaling can improve the performance of SVM

  • The left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal

  • After scaling the right plot looks much better

Advantages of SVM

  • Try to maximize generalizability. Since SVMs are essentially just convex optimization problems, we’re always guaranteed to find a global optimum

  • By softening the margin using a budget (or cost) parameter (𝐶), SVMs are relatively robust to outliers (opinions vary)

  • And finally, using kernel functions, SVMs are flexible enough to adapt to complex nonlinear decision boundaries

Disadvantages of SVM

  • Slow

  • Need extra work to handle multinomial classifications

Sources

  • Hands-On Machine Learning with R (Boehmke & Greenwell)

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)

  • Applied Predictive Modeling (Johnson & Kuhn)

  • An Introduction to Statistical Learning: With Applications in R (James et al.)

  • Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)

  • Feature Engineering and Selection: A Practical Approach for Predictive Models (Johnson & Kuhn)