Data Mining Principles TA Session 2 (January 22, 2021)
Agenda
Unsupervised Learning
K-Means
DBSCAN
Distances
What Is Unsupervised Learning?
Unsupervised learning - identifying groups in a data set, when data do not have labels
Unsupervised learning is often performed as part of an EDA
In unsupervised learning, the dataset is a collection of unlabeled examples
x is a feature vector
- Goal: create a model model that takes a feature vector \(x\) as input and either transforms it into another vector or into a value that can be used to solve a practical problem
Key Types of Unsupervised Learning
The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction)
Clustering - segment observations into similar groups based on the observed variables
Dimension reduction - reducing the number of variables in the dataset
Examples of Unsupervised Learning
Divide consumers into different homogeneous groups
Identify groups of online shoppers with similar browsing and purchase histories
Identify products that have similar purchasing behavior so that managers can manage them as product groups
Identifying Patterns
Unsupervised Learning Techniques
Clustering (K-Means, DBSCAN)
Anomaly detection and novelty detection (One-class SVM, Isolation Forest)
Visualization and dimensionality reduction (PCA, Kernel PCA, t-SNE)
Association rule learning (Apriori, Eclat)
K-Means
A simple and common algorithm capable of clustering data very quickly and efficiently
Proposed by Stuart Lloyd in 1957; it was only published outside of the company in 1982
Have to specify the number of clusters k that the algorithm must find
Each cluster is represented by its center (i.e, centroid) which corresponds to the mean of the observation values assigned to the cluster
Basic idea - constructing clusters so that the total within-cluster variation is minimized
Data
Ideal Result
K-Means Steps
Indicate the number of clusters (𝑘) that will be generated
Randomly select 𝑘 observations from the data set to serve as the initial centers for the clusters
Assign to its closest centroid, where closest is defined using the distance between the object and the cluster mean
Compute the new center (i.e., mean value) of each cluster
Check again to see if point might be closer to a different cluster
Reassigned again using the updated cluster means
Update steps until the cluster assignments stop changing
Approach
Potential Problem
Although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may converge to a local optimum)
Whether it does or not depends on the centroid initialization
Important nuance
- Due to randomization of the initial 𝑘 observations, we can get slightly different results each time
How to select the number of clusters?
We do not know it, because data is unlabeled
Goal: to ascertain what natural distinct groups exist in the data
A human-made decision
Larger values of 𝑘 can improve homogeneity of the clusters; however it risks overfitting
Popular methods (Elbow method, Inertia, Silhouette Score)
Elbow method
Compute k-means clustering for different values of 𝑘
For each 𝑘, calculate the total within-cluster sum of squares (WSS) (the lower - the better)
Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center. Each cluster center will then be updated as the mean for observations in each cluster
Plot the curve of WSS according to the number of clusters 𝑘
The location of a bend (elbow) in the plot is considered as an indicator of the appropriate number of clusters
Elbow Example
Inertia as a Metric
The mean squared distance between each instance and its closest centroid
Run several models and keep the model with the lowest inertia? Not that simple!
The inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k
The more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be
Inertia and Number of k
As you can see, the inertia drops very quickly as we increase k up to 4, but then it decreases much more slowly as we keep increasing k.
Silhouette Score
A more precise approach (but also more computationally expensive)
The mean silhouette coefficient over all the instances
Can vary between –1 and +1
Close to +1 means that the instance is well inside its own cluster and far from other clusters
Close to 0 means that it is close to a cluster boundary
Close to –1 means that the instance may have been assigned to the wrong cluster
Example
Although it confirms that k = 4 is a very good choice, it also underlines the fact that k = 5 is quite good as well, and much better than k = 6 or 7
This was not visible when comparing inertias
Silhouette Diagram
Plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient
Each diagram contains one knife shape per cluster
The shape’s height indicates the number of instances the cluster contains, and its width represents the sorted silhouette coefficients of the instances in the cluster (wider is better)
The dashed line indicates the mean silhouette coefficient
Silhouette Diagram
Interpretation
The vertical dashed lines represent the silhouette score for each number of clusters
We can see that when k = 3 and when k = 6, we get bad clusters
k = 4 or k = 5, the clusters look pretty good: most instances extend beyond the dashed line, to the right and closer to 1.0
When k = 4, the cluster at index 1 is rather big. When k = 5, all clusters have similar sizes
So, even though the overall silhouette score from k = 4 is slightly greater than for k = 5, it seems like a good idea to use k = 5 to get clusters of similar sizes
Which number of k to choose?
Try different approaches (Elmow method, Inertia, Silhouette Score)
Applications requiring the exact optimal set of clusters is fairly rare
In most applications, it suffices to choose a 𝑘 based on convenience rather than strict performance requirements
Ideally you need to understand data or work with an industry expert
Limitations of K-Means
Not perfect
Manual selection of k
Different results due to randomness
Suboptimal solutions are possible
Does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes
It is important to scale the input features before you run K-Means, or the clusters may be very stretched and K-Means will perform poorly (still not a guarantee)
K-Means++
Important improvement to the K-Means algorithm - K-Means++ (2003)
Smarter initialization step that tends to select centroids that are distant from one another, and this improvement makes the K-Means algorithm much less likely to converge to a suboptimal solution
Additional computation required for the smarter initialization step is well worth it because it makes it possible to drastically reduce the number of times the algorithm needs to be run to find the optimal solution
Conclusion
K-Means clustering is probably the most popular clustering algorithm
Start with K-Means, it is simple and fast
You will decide the number of clusters (try different options, use Elbow method, and talk to experts)
K-Means Summary of Steps
Specify the number of clusters (𝑘) to be created
Select 𝑘 observations at random from the data set to use as centroids
Assign each observation to their closest centroid based on the distance measure selected
For each of the 𝑘 clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster
Iterate steps until the cluster assignments stop changing (good rule of thumb - 10–20 iterations)
DBSCAN
DBSCAN defines clusters as continuous regions of high density
For each instance, the algorithm counts how many instances are located within a small distance \(ε\) (epsilon) from it. This region is ε-neighborhood
If an instance has at least
min_samples
instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words (located in dense regions)All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster
Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly
How does it work?
We choose two parameters - \(ε\) and
min_samples
Begin by picking an arbitrary point in dataset
If there are more than ’min_sample` points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a cluster
Expand that cluster by checking all of the new points and seeing if they too have more than
min_samples
pointsRun out of points to add to the cluster, pick a new arbitrary point and repeat
Example
- If we widen each instance’s neighborhood by increasing eps to 0.2, we get the clustering on the right, which looks perfect
Interesting Fact
It is possible that dense points will “fight over” the original point, and it’s arbitrary which of the two clusters it ends up in
Since DBSCAN considers the points in an arbitrary order, the middle point can end up in either the left or the right cluster on different runs
This kind of point is known as a “border point”
Interesting Fact
DBSCAN class does not have a
predict()
method, although it has afit_predict()
It cannot predict which cluster a new instance belongs to
This implementation decision was made because different classification algorithms can be better for different tasks
Authors decided to let the user choose which one to use
DBSCAN Summary
DBSCAN is a very simple yet powerful algorithm capable of identifying any number of clusters of any shape
It is robust to outliers, and it has just two hyperparameters (eps and min_samples)
If the density varies significantly across the clusters, however, it can be impossible for it to capture all the clusters properly
HDBSCAN
HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise
Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon
This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection
HDBSCAN returns a good clustering straight away with little or no parameter tuning and pretty fast
HDBSCAN only has one important hyperparameter: n, the minimum number of examples to put in a cluster
Distances
A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain
Euclidean Distance - the shortest distance between two points
Manhattan Distance - the sum of absolute differences between points across all the dimensions
Minkowski Distance - the generalized form of Euclidean and Manhattan Distance
Hamming Distance measures the similarity between two strings of the same length. The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different
Sources
Hands-On Machine Learning with R (Boehmke & Greenwell)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)
Applied Predictive Modeling (Johnson & Kuhn)
An Introduction to Statistical Learning: With Applications in R (James et al.)
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)
The Hundred-Page Machine Learning Book (Burkov)