Data Mining Principles TA Session 2 (January 22, 2021)

Agenda

  • Unsupervised Learning

  • K-Means

  • DBSCAN

  • Distances

What Is Unsupervised Learning?

Unsupervised learning - identifying groups in a data set, when data do not have labels

  • Unsupervised learning is often performed as part of an EDA

  • In unsupervised learning, the dataset is a collection of unlabeled examples

x is a feature vector

  • Goal: create a model model that takes a feature vector \(x\) as input and either transforms it into another vector or into a value that can be used to solve a practical problem

Key Types of Unsupervised Learning

  • The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction)

  • Clustering - segment observations into similar groups based on the observed variables

  • Dimension reduction - reducing the number of variables in the dataset

Examples of Unsupervised Learning

  • Divide consumers into different homogeneous groups

  • Identify groups of online shoppers with similar browsing and purchase histories

  • Identify products that have similar purchasing behavior so that managers can manage them as product groups

Identifying Patterns

Unsupervised Learning Techniques

  • Clustering (K-Means, DBSCAN)

  • Anomaly detection and novelty detection (One-class SVM, Isolation Forest)

  • Visualization and dimensionality reduction (PCA, Kernel PCA, t-SNE)

  • Association rule learning (Apriori, Eclat)

K-Means

  • A simple and common algorithm capable of clustering data very quickly and efficiently

  • Proposed by Stuart Lloyd in 1957; it was only published outside of the company in 1982

  • Have to specify the number of clusters k that the algorithm must find

  • Each cluster is represented by its center (i.e, centroid) which corresponds to the mean of the observation values assigned to the cluster

  • Basic idea - constructing clusters so that the total within-cluster variation is minimized

Data

Ideal Result

K-Means Steps

  • Indicate the number of clusters (𝑘) that will be generated

  • Randomly select 𝑘 observations from the data set to serve as the initial centers for the clusters

  • Assign to its closest centroid, where closest is defined using the distance between the object and the cluster mean

  • Compute the new center (i.e., mean value) of each cluster

  • Check again to see if point might be closer to a different cluster

  • Reassigned again using the updated cluster means

  • Update steps until the cluster assignments stop changing

Approach

Potential Problem

  • Although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may converge to a local optimum)

  • Whether it does or not depends on the centroid initialization

Interactive example

Visualizing K-Means Clustering

Important nuance

  • Due to randomization of the initial 𝑘 observations, we can get slightly different results each time

How to select the number of clusters?

  • We do not know it, because data is unlabeled

  • Goal: to ascertain what natural distinct groups exist in the data

  • A human-made decision

  • Larger values of 𝑘 can improve homogeneity of the clusters; however it risks overfitting

  • Popular methods (Elbow method, Inertia, Silhouette Score)

Elbow method

  • Compute k-means clustering for different values of 𝑘

  • For each 𝑘, calculate the total within-cluster sum of squares (WSS) (the lower - the better)

  • Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center. Each cluster center will then be updated as the mean for observations in each cluster

  • Plot the curve of WSS according to the number of clusters 𝑘

  • The location of a bend (elbow) in the plot is considered as an indicator of the appropriate number of clusters

Elbow Example

Inertia as a Metric

  • The mean squared distance between each instance and its closest centroid

  • Run several models and keep the model with the lowest inertia? Not that simple!

  • The inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k

  • The more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be

Inertia and Number of k

As you can see, the inertia drops very quickly as we increase k up to 4, but then it decreases much more slowly as we keep increasing k.

Silhouette Score

  • A more precise approach (but also more computationally expensive)

  • The mean silhouette coefficient over all the instances

  • Can vary between –1 and +1

  • Close to +1 means that the instance is well inside its own cluster and far from other clusters

  • Close to 0 means that it is close to a cluster boundary

  • Close to –1 means that the instance may have been assigned to the wrong cluster

Example

  • Although it confirms that k = 4 is a very good choice, it also underlines the fact that k = 5 is quite good as well, and much better than k = 6 or 7

  • This was not visible when comparing inertias

Silhouette Diagram

  • Plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient

  • Each diagram contains one knife shape per cluster

  • The shape’s height indicates the number of instances the cluster contains, and its width represents the sorted silhouette coefficients of the instances in the cluster (wider is better)

  • The dashed line indicates the mean silhouette coefficient

Silhouette Diagram

Interpretation

  • The vertical dashed lines represent the silhouette score for each number of clusters

  • We can see that when k = 3 and when k = 6, we get bad clusters

  • k = 4 or k = 5, the clusters look pretty good: most instances extend beyond the dashed line, to the right and closer to 1.0

  • When k = 4, the cluster at index 1 is rather big. When k = 5, all clusters have similar sizes

  • So, even though the overall silhouette score from k = 4 is slightly greater than for k = 5, it seems like a good idea to use k = 5 to get clusters of similar sizes

Which number of k to choose?

  • Try different approaches (Elmow method, Inertia, Silhouette Score)

  • Applications requiring the exact optimal set of clusters is fairly rare

  • In most applications, it suffices to choose a 𝑘 based on convenience rather than strict performance requirements

  • Ideally you need to understand data or work with an industry expert

Limitations of K-Means

  • Not perfect

  • Manual selection of k

  • Different results due to randomness

  • Suboptimal solutions are possible

  • Does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes

  • It is important to scale the input features before you run K-Means, or the clusters may be very stretched and K-Means will perform poorly (still not a guarantee)

K-Means++

  • Important improvement to the K-Means algorithm - K-Means++ (2003)

  • Smarter initialization step that tends to select centroids that are distant from one another, and this improvement makes the K-Means algorithm much less likely to converge to a suboptimal solution

  • Additional computation required for the smarter initialization step is well worth it because it makes it possible to drastically reduce the number of times the algorithm needs to be run to find the optimal solution

Conclusion

  • K-Means clustering is probably the most popular clustering algorithm

  • Start with K-Means, it is simple and fast

  • You will decide the number of clusters (try different options, use Elbow method, and talk to experts)

K-Means Summary of Steps

  • Specify the number of clusters (𝑘) to be created

  • Select 𝑘 observations at random from the data set to use as centroids

  • Assign each observation to their closest centroid based on the distance measure selected

  • For each of the 𝑘 clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster

  • Iterate steps until the cluster assignments stop changing (good rule of thumb - 10–20 iterations)

DBSCAN

  • DBSCAN defines clusters as continuous regions of high density

  • For each instance, the algorithm counts how many instances are located within a small distance \(ε\) (epsilon) from it. This region is ε-neighborhood

  • If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words (located in dense regions)

  • All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster

  • Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly

How does it work?

  • We choose two parameters - \(ε\) and min_samples

  • Begin by picking an arbitrary point in dataset

  • If there are more than ’min_sample` points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a cluster

  • Expand that cluster by checking all of the new points and seeing if they too have more than min_samples points

  • Run out of points to add to the cluster, pick a new arbitrary point and repeat

Example

  • If we widen each instance’s neighborhood by increasing eps to 0.2, we get the clustering on the right, which looks perfect

Interesting Fact

  • It is possible that dense points will “fight over” the original point, and it’s arbitrary which of the two clusters it ends up in

  • Since DBSCAN considers the points in an arbitrary order, the middle point can end up in either the left or the right cluster on different runs

  • This kind of point is known as a “border point”

Interactive example

Visualizing DBSCAN Clustering

Interesting Fact

  • DBSCAN class does not have a predict() method, although it has a fit_predict()

  • It cannot predict which cluster a new instance belongs to

  • This implementation decision was made because different classification algorithms can be better for different tasks

  • Authors decided to let the user choose which one to use

DBSCAN Summary

  • DBSCAN is a very simple yet powerful algorithm capable of identifying any number of clusters of any shape

  • It is robust to outliers, and it has just two hyperparameters (eps and min_samples)

If the density varies significantly across the clusters, however, it can be impossible for it to capture all the clusters properly

HDBSCAN

  • HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise

  • Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon

  • This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection

  • HDBSCAN returns a good clustering straight away with little or no parameter tuning and pretty fast

  • HDBSCAN only has one important hyperparameter: n, the minimum number of examples to put in a cluster

Distances

  • A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain

  • Euclidean Distance - the shortest distance between two points

  • Manhattan Distance - the sum of absolute differences between points across all the dimensions

  • Minkowski Distance - the generalized form of Euclidean and Manhattan Distance

  • Hamming Distance measures the similarity between two strings of the same length. The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different

Sources

  • Hands-On Machine Learning with R (Boehmke & Greenwell)

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)

  • Applied Predictive Modeling (Johnson & Kuhn)

  • An Introduction to Statistical Learning: With Applications in R (James et al.)

  • Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)

  • The Hundred-Page Machine Learning Book (Burkov)