# Data Mining Principles TA Session 2 (January 22, 2021)

## Agenda

Unsupervised Learning

K-Means

DBSCAN

Distances

## What Is Unsupervised Learning?

**Unsupervised learning** - identifying groups in a data set, when data do not have labels

Unsupervised learning is often performed as part of an EDA

In unsupervised learning, the dataset is a collection of unlabeled examples

*x* is a feature vector

**Goal:**create a model model that**takes a feature vector**\(x\) as input and either**transforms it into another vector or into a value**that can be used to solve a practical problem

## Key Types of Unsupervised Learning

The groups may be defined by the rows (i.e.,

*clustering*) or the columns (i.e.,*dimension reduction*)**Clustering**- segment observations into similar groups based on the observed variables**Dimension reduction**- reducing the number of variables in the dataset

## Examples of Unsupervised Learning

Divide consumers into different homogeneous groups

Identify groups of online shoppers with similar browsing and purchase histories

Identify products that have similar purchasing behavior so that managers can manage them as product groups

## Identifying Patterns

## Unsupervised Learning Techniques

**Clustering**(K-Means, DBSCAN)**Anomaly detection and novelty detection**(One-class SVM, Isolation Forest)**Visualization and dimensionality reduction**(PCA, Kernel PCA, t-SNE)**Association rule learning**(Apriori, Eclat)

## K-Means

A simple and common algorithm capable of clustering data very quickly and efficiently

Proposed by Stuart Lloyd in 1957; it was only published outside of the company in 1982

Have to specify the number of clusters k that the algorithm must find

Each cluster is represented by its center (i.e, centroid) which corresponds to the mean of the observation values assigned to the cluster

Basic idea - constructing clusters so that the total within-cluster variation is minimized

## Data

## Ideal Result

## K-Means Steps

Indicate the number of clusters (𝑘) that will be generated

Randomly select 𝑘 observations from the data set to serve as the initial centers for the clusters

Assign to its closest centroid, where closest is defined using the distance between the object and the cluster mean

Compute the new center (i.e., mean value) of each cluster

Check again to see if point might be closer to a different cluster

Reassigned again using the updated cluster means

Update steps until the cluster assignments stop changing

## Approach

## Potential Problem

Although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may converge to a local optimum)

Whether it does or not depends on the centroid initialization

## Important nuance

- Due to randomization of the initial 𝑘 observations, we can get slightly different results each time

## How to select the number of clusters?

We do not know it, because data is unlabeled

Goal: to ascertain what natural distinct groups exist in the data

A human-made decision

Larger values of 𝑘 can improve homogeneity of the clusters; however it risks overfitting

Popular methods (Elbow method, Inertia, Silhouette Score)

## Elbow method

Compute k-means clustering for different values of 𝑘

For each 𝑘, calculate the total within-cluster sum of squares (WSS) (the lower - the better)

Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center. Each cluster center will then be updated as the mean for observations in each cluster

Plot the curve of WSS according to the number of clusters 𝑘

The location of a bend (elbow) in the plot is considered as an indicator of the appropriate number of clusters

## Elbow Example

## Inertia as a Metric

The mean squared distance between each instance and its closest centroid

**Run several models and keep the model with the lowest inertia?**Not that simple!The inertia is not a good performance metric when trying to choose k because

**it keeps getting lower as we increase k**The more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be

## Inertia and Number of k

As you can see, the inertia drops very quickly as we increase k up to 4, but then it decreases much more slowly as we keep increasing k.

## Silhouette Score

A more precise approach (but also more computationally expensive)

The mean silhouette coefficient over all the instances

Can vary between –1 and +1

Close to +1 means that the instance is well inside its own cluster and far from other clusters

Close to 0 means that it is close to a cluster boundary

Close to –1 means that the instance may have been assigned to the wrong cluster

## Example

Although it confirms that k = 4 is a very good choice, it also underlines the fact that k = 5 is quite good as well, and much better than k = 6 or 7

This was not visible when comparing inertias

## Silhouette Diagram

Plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient

Each diagram contains one knife shape per cluster

The shape’s height indicates the number of instances the cluster contains, and its width represents the sorted silhouette coefficients of the instances in the cluster (wider is better)

The dashed line indicates the mean silhouette coefficient

## Silhouette Diagram

## Interpretation

The vertical dashed lines represent the silhouette score for each number of clusters

We can see that when k = 3 and when k = 6, we get bad clusters

k = 4 or k = 5, the clusters look pretty good: most instances extend beyond the dashed line, to the right and closer to 1.0

When k = 4, the cluster at index 1 is rather big. When k = 5, all clusters have similar sizes

So, even though the overall silhouette score from k = 4 is slightly greater than for k = 5, it seems like a good idea to use k = 5 to get clusters of similar sizes

## Which number of k to choose?

Try different approaches (Elmow method, Inertia, Silhouette Score)

Applications requiring the exact optimal set of clusters is fairly rare

In most applications, it suffices to choose a 𝑘 based on convenience rather than strict performance requirements

Ideally you need to understand data or work with an industry expert

## Limitations of K-Means

Not perfect

Manual selection of k

Different results due to randomness

Suboptimal solutions are possible

Does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes

It is important to scale the input features before you run K-Means, or the clusters may be very stretched and K-Means will perform poorly (still not a guarantee)

## K-Means++

Important improvement to the K-Means algorithm - K-Means++ (2003)

Smarter initialization step that tends to select centroids that are distant from one another, and this improvement makes the K-Means algorithm much less likely to converge to a suboptimal solution

Additional computation required for the smarter initialization step is well worth it because it makes it possible to drastically reduce the number of times the algorithm needs to be run to find the optimal solution

## Conclusion

K-Means clustering is probably the most popular clustering algorithm

Start with K-Means, it is simple and fast

You will decide the number of clusters (try different options, use Elbow method, and talk to experts)

## K-Means Summary of Steps

Specify the number of clusters (𝑘) to be created

Select 𝑘 observations at random from the data set to use as centroids

Assign each observation to their closest centroid based on the distance measure selected

For each of the 𝑘 clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster

Iterate steps until the cluster assignments stop changing (good rule of thumb - 10–20 iterations)

## DBSCAN

DBSCAN defines clusters as continuous regions of high density

For each instance, the algorithm counts how many instances are located within a small distance \(ε\) (epsilon) from it. This region is

**ε-neighborhood**If an instance has at least

`min_samples`

instances in its ε-neighborhood (including itself), then it is considered a**core instance**. In other words (located in dense regions)All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster

Any instance that is not a core instance and does not have one in its neighborhood is considered an

**anomaly**

## How does it work?

We choose two parameters - \(ε\) and

`min_samples`

Begin by picking an arbitrary point in dataset

If there are more than ’min_sample` points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a cluster

Expand that cluster by checking all of the new points and seeing if they too have more than

`min_samples`

pointsRun out of points to add to the cluster, pick a new arbitrary point and repeat

## Example

- If we widen each instance’s neighborhood by increasing eps to 0.2, we get the clustering on the right, which looks perfect

## Interesting Fact

It is possible that dense points will “fight over” the original point, and it’s arbitrary which of the two clusters it ends up in

Since DBSCAN considers the points in an arbitrary order, the middle point can end up in either the left or the right cluster on different runs

This kind of point is known as a “border point”

## Interesting Fact

DBSCAN class does not have a

`predict()`

method, although it has a`fit_predict()`

It cannot predict which cluster a new instance belongs to

This implementation decision was made because different classification algorithms can be better for different tasks

Authors decided to let the user choose which one to use

## DBSCAN Summary

DBSCAN is a very simple yet powerful algorithm capable of identifying any number of clusters of any shape

It is robust to outliers, and it has just two hyperparameters (eps and min_samples)

If the density varies significantly across the clusters, however, it can be impossible for it to capture all the clusters properly

## HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise

Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon

This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection

HDBSCAN returns a good clustering straight away with little or no parameter tuning and pretty fast

HDBSCAN only has one important hyperparameter: n, the minimum number of examples to put in a cluster

## Distances

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain

**Euclidean Distance**- the shortest distance between two points**Manhattan Distance**- the sum of absolute differences between points across all the dimensions**Minkowski Distance**- the generalized form of Euclidean and Manhattan Distance**Hamming Distance**measures the similarity between two strings of the same length. The Hamming Distance between two strings of the same length is the number of positions at which the corresponding characters are different

## Sources

**Hands-On Machine Learning with R**(Boehmke & Greenwell)**Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems**(Geron)**Applied Predictive Modeling**(Johnson & Kuhn)**An Introduction to Statistical Learning: With Applications in R**(James et al.)**Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions**(Taddy)**The Hundred-Page Machine Learning Book**(Burkov)