K-means is one of the most common algorithms to perform unsupervised clustering.
July 10, 2023
•
5
min read
The steps involved in K-means Clustering are:
- Choose the number of clusters K
- Initialize the centroids
- Assign each data point to the closest centroid (based on distance measure, e.g. euclidean)
- Update the centroid by taking the mean of the cluster.
- Repeat steps 3 and 4 until convergence i.e No discernible change in centroids is observed.
Tips
- always scale data for algorithms that use distance measures
- We initialize centroids randomly, so different initializations may lead to different clusters. There is a possibility that the algorithm falls into a local optimum rather than the global optimum. So it's better to run the algorithm multiple times (iterate with different initializations) and select the one with the least sum of squared distances within clusters.
- there is an algorithm (K-means++) that doesn't initialise centroids randomly - the initial centroid should be far away from each other. The algorithm starts by randomly choosing a centroid c0 from all datapoints. For centroid ci, the probability of a datapoint x to be chosen as a centroid is proportional to the squares of the distance of x to its nearest centroid. In this way, k-means++ always tries to select centroids that are far away from the existing centroids, which leads to significant improvement over k-means.
- to make k-means less sensitive to outliers, use an alternative measure of distance - L1 distance, or the sum of the absolute values of the differences. Or use K-Medoids
- to adapt to non-spherical clusters, can use a radial notion of similarity, or splitting the data into polar coordinates, or using kernel methods, or use agglomerative clustering
How to choose K?
- Elbow Method - for each value of k, calculate the sum of squares within clusters, usually called WCSS (Within Cluster Sum of Squares). In the plot of WCSS vs K, find the k where WCSS starts to plateau (ie the elbow)
- gap statistic
- methods that change the objecting of k-means by adding a penalty term (AIC, BIC) to discorage more complex models
Assumptions of K-means
K-Means clustering assumes that we have K spherical equally-sized clusters, and is sensitive to outliers (because of its use of euclidean distances)
- K-means Clustering is limited to linear cluster boundaries: The fundamental model assumptions of K-means Clustering (points will be closer to their cluster center than to others), means that the algorithm will often be ineffective if the clusters have complicated geometries
- All clusters are of the same size.
- Clusters have the same extent in every direction. This assumption would not be true in a dataset where different measurements are in different units.
- Clusters have similar numbers of points assigned to them.
- continuous data
Evaluation
- Rand Index/Adjusted Rand Index - only if you have ground truth
- Visual evaluation - GGOBI tool
- Silhouette coefficient - check that data within a cluster are relatively close to each other & data in different clusters are relatively far from each other
- split data into two sets and perform clustering to each, then compute clustering found across the two datasets