Data Mining Principles TA Session 3 (January 29, 2021)


  • HC (Hierarchical Clustering)

  • PCA (Principal Component Analysis)

  • t-SNE (t-distributed stochastic neighbor embedding)

Hierarchical Clustering

  • HC is an alternative approach to k-means clustering for identifying groups in a data set

  • Creates a hierarchy of clusters and therefore does not require to prespecify the number of clusters

  • Results can be easily visualized with a dendrogram


Aglomerative Clustering

  • Agglomerative clustering (AGNES) works in a bottom-up manner

  • Each observation is initially considered as a single-element cluster (leaf)

  • At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes

  • This procedure is iterated until all points are a member of just one single big cluster (root)

Divisive hierarchical clustering

  • Divisive hierarchical clustering (DIANA) works in a top-down manner

  • DIANA is like the reverse of AGNES

  • Begins with the root, in which all observations are included in a single cluster

  • At each step of the algorithm, the current cluster is split into two clusters that are considered most heterogeneous

  • The process is iterated until all observations are in their own cluster


How to measure the dissimilarity?

  • Maximum or complete linkage clustering - computes all pairwise dissimilarities between the elements in two clusters, and considers the largest value of these dissimilarities as the distance

  • Minimum or single linkage clustering - … considers the smallest of these dissimilarities as a linkage criterion

  • Mean or average linkage clustering - … considers the average of these dissimilarities as the distance

  • Centroid - computes the dissimilarity between the centroid for cluster 1 (a mean vector of length 𝑝, one element for each variable) and the centroid for cluster 2

  • Ward’s minimum variance method - minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged

How do they look like?

How to define the number of clusters?

  • We need to do it manually

  • Again similar methods (such as Elbow method, Silhouette method)

Key Problem

  • Dendrograms are often misinterpreted

  • Conclusions about the proximity of two observations should not be implied by their relationship on the horizontal axis nor by the vertical connections

  • The height of the branch between an observation and the clusters of observations below them indicate the distance between the observation and that cluster it is joined to


Potential questions

  • Should the observations or features first be standardized in some way?

  • What dissimilarity measure should be used?

  • What type of linkage should be used?

  • Where should we cut the dendrogram in order to obtain clusters?

HC Summary

  • HC offers some benefits over k-means

  • Produce nice visualizations

  • Still require to select a method

  • Each method has own biases

  • Data structure and method should match

  • Need to decide where to cut the cluster


  • A commonly used data reduction technique

  • Seeks to find linear combinations of the predictors, known as principal components, which capture the most possible variance

  • The first PC captures the most possible variance

  • Primary advantage - PCA creates components that are uncorrelated

  • Without proper guidance, PCA can generate components that summarize characteristics of the data that are irrelevant

PCA plot

PCA and dimensions

  • If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on—as many axes as the number of dimensions in the dataset

  • For each principal component, PCA finds a zero-centered unit vector pointing in the direction of the PC

  • Since two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable: if you perturb the training set slightly and run PCA again, the unit vectors may point in the opposite direction as the original vectors. However, they will generally still lie on the same axes.

Transform data before PCA

  • PCA seeks linear combinations of predictors that maximize variability, it will naturally first be drawn to summarizing predictors that have more variation

  • PC weights will be larger for the higher variability predictors on the first few components

  • Data usually have predictors are on different scales and predictors may have skewed distributions

  • It is recommended to transform data before applying PCA

  • Enables PCA to find the underlying relationships in the data without being influenced by the original measurement scales

PCA Loadings

  • Each component is a linear combination of the predictors and the coefficient for each predictor is called the loadings

  • Loadings close to zero indicate that the predictor variable did not contribute much to that component

  • Loadings represent each features influence on the associated PC

Loadings Example

How many PC to select?

  • Eigenvalue criterion

  • Proportion of variance explained criterion

  • Scree plot criterion

Eigenvalue Criterion

  • The sum of the eigenvalues is equal to the number of variables entered into the PCA

  • The eigenvalues will range from greater than one to near zero

  • An eigenvalue of 1 means that the principal component would explain about one variable’s worth of the variability

  • The rationale for using the eigenvalue criterion is that each component should explain at least one variable’s worth of the variability, and therefore, the eigenvalue criterion states that only components with eigenvalues greater than 1 should be retained

Eigenvalue Creterion Plot

Proportion of Variance Explained Criterion

  • The proportion of variance explained (PVE) identifies the optimal number of PCs to keep based on the total variability that we would like to account for

  • What amount of variability is reasonable? This varies by application and the data being used

  • When the PCs are being used for descriptive purposes only then the proportion of variability explained may be lower than otherwise

  • When the PCs are to be used as derived features for models downstream, then the PVE should be as much as can conveniently be achieved, given any constraints

Proportion of Variance Explained Criterion Plot

Scree Plot Criterion

  • A scree plot shows the eigenvalues or PVE for each individual PC

  • Most scree plots look broadly similar in shape, starting high on the left, falling rather quickly, and then flattening out at some point

  • This is because the first component usually explains much of the variability

Scree Plot

Cons of PCA

  • PCA can be highly affected by outliers

  • PC directions are linear and traditional PCA does not perform as well in very high dimensional space where complex nonlinear patterns often exist


  • A popular method for exploring high-dimensional data

  • Introduced in 2008 - “Visualizing Data using t-SNE” by van der Maaten and Hinton

  • Able to create compelling two-dimensonal “maps” from data with hundreds or even thousands of dimensions

  • The goal - take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane

  • t-SNE is non-linear (and PCA is linear)

t-SNE Epsilon and Perplexity

  • Epsilon - learning rate

  • Perplexity tells how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has

  • Changing perplexity has a complex effect on plots

  • “The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50”

t-SNE Perplexity

  • If you see a t-SNE plot with strange “pinched” shapes, chances are the process was stopped too early

  • Unfortunately, there’s no fixed number of steps that yields a stable result

  • Different data sets can require different numbers of iterations to converge


  • Cluster sizes in a t-SNE plot mean nothing

  • Distances between clusters might not mean anything

  • Random noise doesn’t always look random

T-SNE Weakness

  • Unclear how t-SNE performs on general dimensionality reduction tasks

  • The relatively local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data

  • t-SNE is not guaranteed to converge to a global optimum of its cost function


  • Hands-On Machine Learning with R (Boehmke & Greenwell)

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)

  • Applied Predictive Modeling (Johnson & Kuhn)

  • An Introduction to Statistical Learning: With Applications in R (James et al.)

  • Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)

  • The Hundred-Page Machine Learning Book (Burkov)