Comprehensive Overview of Clustering Algorithms in Machine Learning
Nowadays, data science, particularly machine learning (ML), is gaining increasing popularity. Tasks such as making recommendations, predicting behavior and future choices, and grouping individuals by preferences can all be effectively addressed using ML algorithms. The substantial interest in this area is due to its successful application for managing a variety of important practical problems. For example, data science can be profitably implemented in health care, retail, telecommunications, real estate, and many other fields.
One of the most promising and important branches of data science is clustering. Clustering is a ML technique that groups objects (clusters) based on their similarity, ensuring that the most similar objects are placed within the same cluster. At the same time, distinctive objects should be assigned to different clusters. Clustering could be fruitfully exploited in retail and other areas.
In this article, we consider some of the most popular ML families of clustering algorithms. We should mention that generally it is not known which algorithm is the most suitable for a considered problem. It is usually a good practice to apply several approaches and then select the most appropriate one.
Centroid-based clustering
In this approach, one should specify the number of clusters k. The aim of the algorithm is to determine k cluster centers and group the objects nearest to these centers into clusters. One of the most popular representatives of such algorithm types is k-means. The algorithm follows this scheme:
(a) randomly choose coordinates for k cluster centers;
(b) unite points nearest to a center into clusters;
(c) recalculate new cluster centers as averages of the coordinates of the objects in the cluster.
Steps (b) and (c) are repeated until the cluster centers no longer change their positions.
Density-based clustering
In this approach, the algorithm identifies regions with a high density of points, separated by empty or sparse areas from other parts. These dense regions constitute clusters, while points not assigned to any cluster are considered noise. One of the most commonly used algorithms from this category is DBSCAN (density-based spatial clustering of applications with noise), it requires two parameters: minimum number of points in a cluster N, and the maximum distance between points in a cluster d. The process begins with the selection of an arbitrary unvisited point. The algorithm retrieves all points within the distance d. If the number of such points is greater than or equal to N, they are assigned to a cluster. Otherwise, the selected point is labeled as noise, and a new unvisited point is randomly chosen. For all points assigned to a cluster, the algorithm finds points within distance from them, and these new points are also assigned to the cluster. This procedure continues until there are no more unvisited points in the cluster. Once a cluster is formed, the algorithm selects a random point and attempts to form another cluster around it if possible. The algorithm stops when all points are assigned to clusters or labeled as noise.
Connectivity-based (hierarchical) clustering
In this approach, a hierarchy of objects is constructed (see dendrogram in Fig. 1). There are two possible ways to create such a structure: agglomerative and divisive. In the agglomerative approach, all points are initially considered as individual clusters. In the second step, the algorithm merges the two nearest neighbors into one cluster. This process is repeated until all points are united in a single cluster. Conversely, the divisive approach starts with all points forming a single cluster. In subsequent steps, clusters are recursively split into smaller ones.
Fig.1. Dendrogram (a tree-like representation) of hierarchical relations between objects.
Distribution-based clustering
In the previous cases, the main metric for building clusters was distance. In this case, it is probability. In this family of algorithms, clusters are formed by objects, which most likely belong to the same distribution. Each cluster is characterized by its center, and the probability of a point being assigned to a cluster decreases as it moves further away from the cluster center.
Example of implementation clustering with Machine Learning
In practice, clustering can be effectively implemented by exploiting Python, particularly with the scikit-learn library.On the scikit-learn web-page, in the clustering section, you can find implementations of various ML clustering algorithms. The section provides comparisons of algorithms along with their mathematical descriptions.
Let's take a look at k-means clustering using scikit-learn. The first step involves importing necessary libraries and initializing points for analysis.
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[[1, 3], [1, 2], [1, 1], [2, 2], [2, 4], [2, 0],
[10, 1], [10, 3], [10, 2], [9, 1], [9, 2], [9, 0]])
The next step involves applying the algorithm by creating an instance of the KMeans class using our data. As mentioned earlier, for the k-means algorithm, it's necessary to define the number of clusters, n_clusters, which, in this case, is set to 2. Other values are left as default.
kmeans = KMeans(n_clusters=2).fit(X)
The created instance has several interesting attributes, such as labeles_ (which contains information about the labels of data points), and centers_ (which contains information about the coordinates of cluster centers). Let's have a look at these attributes for the considered case.
Input:
print('labels: ', kmeans.labels_)
print('centers: ', kmeans.cluster_centers_)
Output:
labels: [0 0 0 0 0 0 1 1 1 1 1 1]
centers: [[1.5 2. ]
[9.5 1.5]
One can also use the method predict(X) to label each sample in X
Input:
print('predicted labels: ', kmeans.predict([[0, 0], [12, 3]]))
Output:
predicted labels: [0 1]
An instance of the KMeans class offers a variety of interesting methods and attributes, which can be explored on the web-page. Have fun, play with different clustering algorithms and select the most appropriate one for your problem.
Conclusion
We've explored the most popular families of clustering algorithms, delved into the mathematical descriptions of some algorithms, and took advantage of the powerful Python library, scikit-learn, to implement the k-means clustering algorithm. Undoubtedly, clustering is immensely interesting from a practical standpoint and can be profitably applied to enhance your business. In the age of AI, the future is now — don't wait, embrace the best practices to refine your business strategy and stay ahead of the curve.
Contact us