# Data Mining Principles TA Session 3 (January 29, 2021)

## Agenda

HC (Hierarchical Clustering)

PCA (Principal Component Analysis)

t-SNE (t-distributed stochastic neighbor embedding)

## Hierarchical Clustering

HC is an alternative approach to k-means clustering for identifying groups in a data set

Creates a hierarchy of clusters and therefore does not require to prespecify the number of clusters

Results can be easily visualized with a

**dendrogram**

## Dendrogram

### Aglomerative Clustering

Agglomerative clustering (AGNES) works in a bottom-up manner

Each observation is initially considered as a single-element cluster (leaf)

At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes

This procedure is iterated until all points are a member of just one single big cluster (root)

## Divisive hierarchical clustering

Divisive hierarchical clustering (DIANA) works in a top-down manner

DIANA is like the reverse of AGNES

Begins with the root, in which all observations are included in a single cluster

At each step of the algorithm, the current cluster is split into two clusters that are considered most heterogeneous

The process is iterated until all observations are in their own cluster

## AGNES and DIANA

## How to measure the dissimilarity?

**Maximum or complete linkage clustering**- computes all pairwise dissimilarities between the elements in two clusters, and considers the largest value of these dissimilarities as the distance**Minimum or single linkage clustering**- … considers the smallest of these dissimilarities as a linkage criterion**Mean or average linkage clustering**- … considers the average of these dissimilarities as the distance**Centroid**- computes the dissimilarity between the centroid for cluster 1 (a mean vector of length 𝑝, one element for each variable) and the centroid for cluster 2**Ward’s minimum variance method**- minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged

## How do they look like?

## How to define the number of clusters?

We need to do it manually

Again similar methods (such as Elbow method, Silhouette method)

## Key Problem

Dendrograms are often misinterpreted

Conclusions about the proximity of two observations should not be implied by their relationship on the horizontal axis nor by the vertical connections

The height of the branch between an observation and the clusters of observations below them indicate the distance between the observation and that cluster it is joined to

## Example

## Potential questions

Should the observations or features first be standardized in some way?

What dissimilarity measure should be used?

What type of linkage should be used?

Where should we cut the dendrogram in order to obtain clusters?

## HC Summary

HC offers some benefits over k-means

Produce nice visualizations

Still require to select a method

Each method has own biases

Data structure and method should match

Need to decide where to cut the cluster

## PCA

A commonly used data reduction technique

Seeks to find linear combinations of the predictors, known as principal components, which capture the most possible variance

The first PC captures the most possible variance

Primary advantage - PCA creates components that are uncorrelated

Without proper guidance, PCA can generate components that summarize characteristics of the data that are irrelevant

## PCA plot

## PCA and dimensions

If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on—as many axes as the number of dimensions in the dataset

For each principal component, PCA finds a zero-centered unit vector pointing in the direction of the PC

Since two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable: if you perturb the training set slightly and run PCA again, the unit vectors may point in the opposite direction as the original vectors. However, they will generally still lie on the same axes.

## Transform data before PCA

PCA seeks linear combinations of predictors that maximize variability, it will naturally first be drawn to summarizing predictors that have more variation

PC weights will be larger for the higher variability predictors on the first few components

Data usually have predictors are on different scales and predictors may have skewed distributions

It is recommended to transform data before applying PCA

Enables PCA to find the underlying relationships in the data without being influenced by the original measurement scales

## PCA Loadings

Each component is a linear combination of the predictors and the coefficient for each predictor is called the

**loadings**Loadings close to zero indicate that the predictor variable did not contribute much to that component

Loadings represent each features influence on the associated PC

## Loadings Example

## How many PC to select?

Eigenvalue criterion

Proportion of variance explained criterion

Scree plot criterion

## Eigenvalue Criterion

The sum of the eigenvalues is equal to the number of variables entered into the PCA

The eigenvalues will range from greater than one to near zero

An eigenvalue of 1 means that the principal component would explain about one variable’s worth of the variability

The rationale for using the eigenvalue criterion is that each component should explain at least one variable’s worth of the variability, and therefore, the eigenvalue criterion states that only components with eigenvalues greater than 1 should be retained

## Eigenvalue Creterion Plot

## Proportion of Variance Explained Criterion

The proportion of variance explained (PVE) identifies the optimal number of PCs to keep based on the total variability that we would like to account for

What amount of variability is reasonable? This varies by application and the data being used

When the PCs are being used for descriptive purposes only then the proportion of variability explained may be lower than otherwise

When the PCs are to be used as derived features for models downstream, then the PVE should be as much as can conveniently be achieved, given any constraints

## Proportion of Variance Explained Criterion Plot

## Scree Plot Criterion

A scree plot shows the eigenvalues or PVE for each individual PC

Most scree plots look broadly similar in shape, starting high on the left, falling rather quickly, and then flattening out at some point

This is because the first component usually explains much of the variability

## Scree Plot

## Cons of PCA

PCA can be highly affected by outliers

PC directions are linear and traditional PCA does not perform as well in very high dimensional space where complex nonlinear patterns often exist

## t-SNE

A popular method for exploring high-dimensional data

Introduced in 2008 - “Visualizing Data using t-SNE” by van der Maaten and Hinton

Able to create compelling two-dimensonal “maps” from data with hundreds or even thousands of dimensions

The goal - take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane

t-SNE is non-linear (and PCA is linear)

## t-SNE Epsilon and Perplexity

Epsilon - learning rate

Perplexity tells how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has

Changing perplexity has a complex effect on plots

“The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50”

## t-SNE Perplexity

If you see a t-SNE plot with strange “pinched” shapes, chances are the process was stopped too early

Unfortunately, there’s no fixed number of steps that yields a stable result

Different data sets can require different numbers of iterations to converge

## Also

Cluster sizes in a t-SNE plot mean nothing

Distances between clusters might not mean anything

Random noise doesn’t always look random

## Interactive t-SNE Visualizations

## t-SNE Video

## T-SNE Weakness

Unclear how t-SNE performs on general dimensionality reduction tasks

The relatively local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data

t-SNE is not guaranteed to converge to a global optimum of its cost function

## Sources

**Hands-On Machine Learning with R**(Boehmke & Greenwell)**Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems**(Geron)**Applied Predictive Modeling**(Johnson & Kuhn)**An Introduction to Statistical Learning: With Applications in R**(James et al.)**Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions**(Taddy)**The Hundred-Page Machine Learning Book**(Burkov)