Data Mining Principles TA Session 3 (January 29, 2021)
Agenda
HC (Hierarchical Clustering)
PCA (Principal Component Analysis)
t-SNE (t-distributed stochastic neighbor embedding)
Hierarchical Clustering
HC is an alternative approach to k-means clustering for identifying groups in a data set
Creates a hierarchy of clusters and therefore does not require to prespecify the number of clusters
Results can be easily visualized with a dendrogram
Dendrogram
Aglomerative Clustering
Agglomerative clustering (AGNES) works in a bottom-up manner
Each observation is initially considered as a single-element cluster (leaf)
At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes
This procedure is iterated until all points are a member of just one single big cluster (root)
Divisive hierarchical clustering
Divisive hierarchical clustering (DIANA) works in a top-down manner
DIANA is like the reverse of AGNES
Begins with the root, in which all observations are included in a single cluster
At each step of the algorithm, the current cluster is split into two clusters that are considered most heterogeneous
The process is iterated until all observations are in their own cluster
AGNES and DIANA
How to measure the dissimilarity?
Maximum or complete linkage clustering - computes all pairwise dissimilarities between the elements in two clusters, and considers the largest value of these dissimilarities as the distance
Minimum or single linkage clustering - … considers the smallest of these dissimilarities as a linkage criterion
Mean or average linkage clustering - … considers the average of these dissimilarities as the distance
Centroid - computes the dissimilarity between the centroid for cluster 1 (a mean vector of length 𝑝, one element for each variable) and the centroid for cluster 2
Ward’s minimum variance method - minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged
How do they look like?
How to define the number of clusters?
We need to do it manually
Again similar methods (such as Elbow method, Silhouette method)
Key Problem
Dendrograms are often misinterpreted
Conclusions about the proximity of two observations should not be implied by their relationship on the horizontal axis nor by the vertical connections
The height of the branch between an observation and the clusters of observations below them indicate the distance between the observation and that cluster it is joined to
Example
Potential questions
Should the observations or features first be standardized in some way?
What dissimilarity measure should be used?
What type of linkage should be used?
Where should we cut the dendrogram in order to obtain clusters?
HC Summary
HC offers some benefits over k-means
Produce nice visualizations
Still require to select a method
Each method has own biases
Data structure and method should match
Need to decide where to cut the cluster
PCA
A commonly used data reduction technique
Seeks to find linear combinations of the predictors, known as principal components, which capture the most possible variance
The first PC captures the most possible variance
Primary advantage - PCA creates components that are uncorrelated
Without proper guidance, PCA can generate components that summarize characteristics of the data that are irrelevant
PCA plot
PCA and dimensions
If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on—as many axes as the number of dimensions in the dataset
For each principal component, PCA finds a zero-centered unit vector pointing in the direction of the PC
Since two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable: if you perturb the training set slightly and run PCA again, the unit vectors may point in the opposite direction as the original vectors. However, they will generally still lie on the same axes.
Transform data before PCA
PCA seeks linear combinations of predictors that maximize variability, it will naturally first be drawn to summarizing predictors that have more variation
PC weights will be larger for the higher variability predictors on the first few components
Data usually have predictors are on different scales and predictors may have skewed distributions
It is recommended to transform data before applying PCA
Enables PCA to find the underlying relationships in the data without being influenced by the original measurement scales
PCA Loadings
Each component is a linear combination of the predictors and the coefficient for each predictor is called the loadings
Loadings close to zero indicate that the predictor variable did not contribute much to that component
Loadings represent each features influence on the associated PC
Loadings Example
How many PC to select?
Eigenvalue criterion
Proportion of variance explained criterion
Scree plot criterion
Eigenvalue Criterion
The sum of the eigenvalues is equal to the number of variables entered into the PCA
The eigenvalues will range from greater than one to near zero
An eigenvalue of 1 means that the principal component would explain about one variable’s worth of the variability
The rationale for using the eigenvalue criterion is that each component should explain at least one variable’s worth of the variability, and therefore, the eigenvalue criterion states that only components with eigenvalues greater than 1 should be retained
Eigenvalue Creterion Plot
Proportion of Variance Explained Criterion
The proportion of variance explained (PVE) identifies the optimal number of PCs to keep based on the total variability that we would like to account for
What amount of variability is reasonable? This varies by application and the data being used
When the PCs are being used for descriptive purposes only then the proportion of variability explained may be lower than otherwise
When the PCs are to be used as derived features for models downstream, then the PVE should be as much as can conveniently be achieved, given any constraints
Proportion of Variance Explained Criterion Plot
Scree Plot Criterion
A scree plot shows the eigenvalues or PVE for each individual PC
Most scree plots look broadly similar in shape, starting high on the left, falling rather quickly, and then flattening out at some point
This is because the first component usually explains much of the variability
Scree Plot
Cons of PCA
PCA can be highly affected by outliers
PC directions are linear and traditional PCA does not perform as well in very high dimensional space where complex nonlinear patterns often exist
t-SNE
A popular method for exploring high-dimensional data
Introduced in 2008 - “Visualizing Data using t-SNE” by van der Maaten and Hinton
Able to create compelling two-dimensonal “maps” from data with hundreds or even thousands of dimensions
The goal - take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane
t-SNE is non-linear (and PCA is linear)
t-SNE Epsilon and Perplexity
Epsilon - learning rate
Perplexity tells how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has
Changing perplexity has a complex effect on plots
“The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50”
t-SNE Perplexity
If you see a t-SNE plot with strange “pinched” shapes, chances are the process was stopped too early
Unfortunately, there’s no fixed number of steps that yields a stable result
Different data sets can require different numbers of iterations to converge
Also
Cluster sizes in a t-SNE plot mean nothing
Distances between clusters might not mean anything
Random noise doesn’t always look random
Interactive t-SNE Visualizations
t-SNE Video
T-SNE Weakness
Unclear how t-SNE performs on general dimensionality reduction tasks
The relatively local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data
t-SNE is not guaranteed to converge to a global optimum of its cost function
Sources
Hands-On Machine Learning with R (Boehmke & Greenwell)
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (Geron)
Applied Predictive Modeling (Johnson & Kuhn)
An Introduction to Statistical Learning: With Applications in R (James et al.)
Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)
The Hundred-Page Machine Learning Book (Burkov)