
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Clustering Performance Evaluation in Scikit-Learn
Clustering is a fundamental unsupervised learning technique that aims to discover patterns or groupings in unlabeled data. It plays a crucial role in various domains such as data mining, pattern recognition, and customer segmentation. However, once clustering algorithms are applied, it becomes essential to evaluate their performance and assess the quality of the resulting clusters.
Clustering performance evaluation is a critical step in understanding the effectiveness and reliability of clustering algorithms. It involves quantifying the quality of the obtained clusters and providing insights into their consistency and separability. By evaluating clustering results, practitioners can make informed decisions about algorithm selection, parameter tuning, and interpretability of the discovered clusters.
In this article, we will explore the concept of clustering performance evaluation using the Scikit?Learn library in Python.
To illustrate the concept of clustering performance evaluation, let's consider an example where we perform clustering on a dataset.
Consider the code shown below.
Example
from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Generate random points features, targets = make_blobs(n_samples=500, centers=5, random_state=42, shuffle=False) # Create the scatter plot plt.scatter(features[:, 0], features[:, 1]) # Customize plot appearance plt.title("Random Points Scatter Plot") plt.xlabel("X-axis") plt.ylabel("Y-axis") # Display the plot plt.show()
Output

K?Means
In the below example, we will make use of the k?means algorithm.
Consider the code shown below.
Example
import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score # Generate sample data X, y_true = make_blobs(n_samples=500, centers=4, random_state=42) # Perform clustering using k-means algorithm kmeans = KMeans(n_clusters=4, random_state=42) y_pred = kmeans.fit_predict(X) # Evaluate clustering performance using metrics silhouette = silhouette_score(X, y_pred) calinski_harabasz = calinski_harabasz_score(X, y_pred) davies_bouldin = davies_bouldin_score(X, y_pred) # Plot the clustering results plt.scatter(X[:, 0], X[:, 1], c=y_pred) plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', label='Centroids') plt.title('K-means Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.show() # Print the evaluation scores print(f"Silhouette Score: {silhouette:.3f}") print(f"Calinski-Harabasz Index: {calinski_harabasz:.3f}") print(f"Davies-Bouldin Index: {davies_bouldin:.3f}")
Output

Performance Evaluation Indices
Silhouette Score
The Silhouette Score is a widely used metric to evaluate the quality of clustering results. It measures how similar a data point is to its own cluster compared to other clusters. The score ranges from ?1 to 1, where a higher value indicates better clustering performance. A value close to 1 suggests that data points are well?clustered and properly separated, while a value close to ?1 indicates that data points may have been assigned to the wrong clusters. In the code, the Silhouette Score is calculated using the silhouette_score() function.
Consider the code shown below.
Example
from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Generate sample data X, _ = make_blobs(n_samples=500, centers=4, random_state=42) # Perform clustering using K-means algorithm kmeans = KMeans(n_clusters=4, random_state=42) y_pred = kmeans.fit_predict(X) # Calculate the Silhouette Score silhouette = silhouette_score(X, y_pred) # Print the Silhouette Score print("Silhouette Score:", silhouette)
Output
Silhouette Score: 0.7911042588289479
Calinski?Harabasz Index
The Calinski?Harabasz Index, also known as the Variance Ratio Criterion, is another performance evaluation metric for clustering. It measures the ratio of between?cluster dispersion to within?cluster dispersion. A higher Calinski?Harabasz Index value indicates better clustering performance, with a higher separation between clusters and lower variance within clusters. In the code, the Calinski?Harabasz Index is calculated using the calinski_harabasz_score() function.
Consider the code shown below.
Example
from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import calinski_harabasz_score # Generate sample data X, _ = make_blobs(n_samples=500, centers=4, random_state=42) # Perform clustering using K-means algorithm kmeans = KMeans(n_clusters=4, random_state=42) y_pred = kmeans.fit_predict(X) # Calculate the Calinski-Harabasz Index calinski_harabasz = calinski_harabasz_score(X, y_pred) # Print the Calinski-Harabasz Index print("Calinski-Harabasz Index:", calinski_harabasz)
Output
Calinski-Harabasz Index: 5742.035759058726
Conclusion
In conclusion, evaluating the performance of clustering algorithms is crucial to assess their effectiveness in grouping data points. In this article, we explored two commonly used performance evaluation metrics: the Silhouette Score and the Calinski?Harabasz Index.
The Silhouette Score measures the quality and separation of clusters by considering the average distance between samples within the same cluster and samples in other clusters. A higher Silhouette Score indicates better clustering performance, with well?separated and distinct clusters.
The Calinski?Harabasz Index evaluates the clustering performance by considering the ratio of between?cluster dispersion to within?cluster dispersion. A higher Calinski?Harabasz Index suggests better clustering performance, with higher separation between clusters and lower variance within clusters.
By utilising these evaluation metrics, we can quantitatively assess the quality of clustering results and make informed decisions about the choice of clustering algorithms and parameter settings.