Silhouette Score
The Silhouette Score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It ranges from -1 to 1, where:
A score of 1 indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters.
A score of -1 indicates that the object is poorly matched to its own cluster and well-matched to neighboring clusters.
A score of 0 indicates that the object is on or near the decision boundary between two clusters.
The Silhouette Score is calculated using the following formula:
`silhouette = (b - a) / max(a, b)`
where:
`a` is the average distance of the object to all other objects in its own cluster (cohesion).
`b` is the average distance of the object to all other objects in the next nearest cluster (separation).
To evaluate a clustering model using the Silhouette Score, you can:
Calculate the Silhouette Score for each object in the dataset.
Calculate the average Silhouette Score across all objects.
A higher average Silhouette Score indicates that the clustering model is doing a good job of separating clusters.
Calinski-Harabasz Index
The Calinski-Harabasz Index is a measure of the ratio of between-cluster variance to within-cluster variance. It ranges from 0 to infinity, where:
A higher value indicates a better clustering solution, with well-separated and dense clusters.
A lower value indicates a worse clustering solution, with poorly separated or sparse clusters.
The Calinski-Harabasz Index is calculated using the following formula:
`CH = (B / W) ((N - K) / (K - 1))`
where:
`B` is the between-cluster variance.
`W` is the within-cluster variance.
`N` is the total number of objects in the dataset.
`K` is the number of clusters.
To evaluate a clustering model using the Calinski-Harabasz Index, you can:
Calculate the Calinski-Harabasz Index for the clustering solution.
Compare the index value to a reference value (e.g., the index value for a random clustering solution).
A higher Calinski-Harabasz Index value indicates that the clustering model is doing a good job of separating clusters.
Example in Python: