“Unsupervised learning is the way most of the people will learn in the future. You have this model of how the world works in your head and you’re refining it to predict what you think is going to happen in the future”- Mark Zuckerberg
Unsupervised Learning is a class of machine learning techniques where there are only input variables and no output variables called as labels in the data. Goal of algorithms here is to find interesting structure in the data.
Further Unsupervised learning can be classified into clustering and association techniques
Clustering as the name suggests is the task to divide data into clusters such that the data within the group has more similar characteristics than the data outside the group.
Document classification, Fraud detection in Insurance and Banking, customer segmentation for marketing are major applications of clustering.
If you are a Data Scientist or you are in the process of becoming a Data Scientist, this blog post is for you! This blog post covers top clustering algorithms every Data Scientist should know.
Top Clustering Algorithms for Data Scientists:
1. K-Means Clustering
2. Mean Shift Clustering
3. Density based Spatial Clustering of Applications with Noise(DBSCAN)
4. Agglomerative Hierarchical Clustering
K-Means clustering algorithm is used to find ‘K’ different groups in the data. The algorithm works iteratively to assign each data point to one of the ‘K’ groups based on the features of the data point. Grouping is done based on the feature similarity.
Rather than defining groups of the data points before hand, K means algorithm defines groups within the data organically.
How does this algorithm work?
Different Steps involved in K Means clustering are as follows:
1. Number of groups to be formed (K) and data is provided to the algorithm.
2. Based on the value of K, the Algorithm starts with initializing K different centroids either randomly or by selecting random data points from the dataset.
3. Each centroid defines one cluster. Data points are assigned to its nearest centroid, based on the Euclidean distance of the data point from every centroid.
4. The next step is to recalculate centroids by taking the mean of all data points assigned to that centroid's cluster.
5. Step 3 is performed again followed by step 4.This process continues until the stopping criteria is met(i.e. the centroids don’t change on further iterations, the distance between the data points and the centroid is minimized or the algorithm has reached maximum iterations)
But how do we decide the best value for K?
One method to decide the best value for number of clusters (K) is the Elbow method. The idea of the elbow method is to run the K Means Clustering algorithm for a range of values of K and for each value of K calculate the sum of squared errors. Line chart of the SSE for each value of K is then plotted. If the line chart resembles an arm, then the "elbow" of the arm is the best value of k. With increasing value of K the SSE approaches zero, hence the idea is not to find the value of K where SSE is zero or minimum but the value of K for which SSE starts diminishing rapidly.
Mean Shift clustering algorithm is a non-parametric algorithm build upon the concept of kernel density estimation (KDE).Let us try to understand what Kernel Density Estimation is?
Kernel Density Estimation is a method to estimate the distribution of the data. It’s a technique that lets you create a smooth curve given a set of data.
Essentially, mean shift supposes that all points represent samples from some underlying probability density function (e.g.: KDE), with regions of high sample density corresponding to the local maxima of this distribution. To find these local maxima, the algorithm works by allowing the points to attract each other, allowing the points to gravitate towards areas of higher density, one can show that they will eventually coalesce at a series of points, close to the local maxima of the distribution. Those data points that converge to the same local maxima are considered to be members of the same cluster.
Mean shift algorithm has primarily been applied to problems in computer vision, where it has been used for image segmentation, clustering, and video tracking. Application to big data problems can be challenging due to the fact the algorithm can become relatively slow in this limit. However, research is presently underway to speed up its convergence, which should enable its application to larger data sets.
Density based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN algorithm as per its name focuses on density and proximity of the data points to form clusters. DBSCAN clustering can identify outliers, observations which won’t belong to any cluster. Since DBSCAN clustering identifies the number of clusters as well, it is very useful with unsupervised learning of the data when we don’t know how many clusters could be there in the data.
Two important parameters for DBSCAN are:
1. Neighbourhood (eps): the minimum distance between two points. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbours
2. Minimum points: the minimum number of points to form a dense region
Since clustering entirely depends on the parameters n and m (above), choosing these values correctly is very important.
The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. Algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.
These are the few commonly used clustering algorithms. There are a plethora of clustering algorithms and each of these have their own pros and cons.
As the saying goes, a picture is worth a thousand words, let’s see a visualization of outputs of different clustering algorithms combined.