Really, what is Hopkins statistic?
Well, you know it is the method to check the clustering tendency for unsupervised machine learning. Well, that’s right! 😊
In unsupervised machine learning, a big issue is that clustering methods will return clusters even if the data does not contain any clusters.
Right!!! because that is what it supposed to do.
So, before choosing partitioning methods (K-Means, etc.) and hierarchical clustering which is used to split the data into clusters of somewhat similar objects. A doubt comes that does the dataset contains any inherent clusters?
Answer to the above question is Hopkins statistic.
Let's See the Hopkin statistics with Python 3!!
Contents:
- Required Python packages
- Steps for data preparation
- Statistical methods
- Summary
- References
Required Python packages:
from sklearn.neighbors import NearestNeighbors
Nearest neighbor methods is to find a predefined number of training samples
closest in distance to the new point, and predict the label from these.
standard Euclidean distance is the most common choice.
from random import sample
sample() is an inbuilt function of the random module in Python that returns a particular length list of items chosen from the sequence i.e. list, tuple, string, set, etc.
from numpy.random import uniform
Draw samples from a uniform distribution.
from math import isnan
This method is used to check whether a given parameter is a valid number or not.
Along with these packages we use pandas and NumPy for reading .csv file and numerical Python functions.
‘https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv' < Link for iris dataset>
For Analysis, we are using the iris dataset.
import pandas as pdIris = pd.read_csv(‘https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
Steps for Data Preparation:
- Dropping categorical column “species”
Iris = Iris.drop([‘species’],axis = 1)
2. Standardising values using “StandardScaler”
import sklearn
from sklearn.preprocessing import StandardScaler# instantiate
scaler = StandardScaler()
# fit_transform
Iris_scaled = scaler.fit_transform(Iris)
3. Defining Hopkins function
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnandef hopkins(X):
d = X.shape[1]
#d = len(vars) # columns
n = len(X) # rows
m = int(0.1 * n)
nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
rand_X = sample(range(0, n, 1), m)
ujd = []
wjd = []
for j in range(0, m):
u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
ujd.append(u_dist[0][1])
w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
wjd.append(w_dist[0][1])
H = sum(ujd) / (sum(ujd) + sum(wjd))
if isnan(H):
print(ujd, wjd)
H = 0
return H
4. Applying Hopkins function to “Iris_scaled”
But, 1st you need to convert “Iris_scaled” to pandas DataFrame & Define column names again
Iris_scaled = pd.DataFrame(Iris_scaled)
Iris_scaled.columns =['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
Now, we can apply the Hopkins function to scaled dataset
hopkins(Iris_scaled)
Let me tell you what happens here.
So, function hopkins compare scatter spread of data points from “Iris_scaled” to Random scatter data which contains no cluster tendency or properties. Using NearestNeighbors.
Also, We might get different values each time as we have “rand_x” which gives a random set of values each time.
Hopkins test tells us that how much percentage different is our data from random scatter data.
And Yes, we can do Hopkins test before scaling or after the scaling as the scale of x-axis & y-axis changes it does not affect the spread of points.
Statistical method
- The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by uniform data distribution. In other words, it tests the spatial randomness of the data.
Let’s Define the Null and the alternate hypothesis:
- Null hypothesis: the data set D is uniformly distributed (i.e., no meaningful clusters)
- Alternative hypothesis: the data set D is not uniformly distributed (i.e., contains meaningful clusters)
So, Hopkins Statistics calculated as follow with python 3:
- Sample uniformly n points (p1,p2,…, pn) from D.
- Compute the distance, xi, from each real point to each nearest neighbour: For each point, pi ∈ D, find it’s nearest neighbour pj; then compute the distance between pi and pj and denote it as xi=dist(pi,pj)
ujd = []
for j in range(0, m):
u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
ujd.append(u_dist[0][1])
3. Generate a simulated data set (random D) drawn from a random uniform distribution with n points (q1,q2,…, qn) and the same variation as the original real data set D.
rand_X = sample(range(0, n, 1), m)
4. Compute the distance, yi from each artificial point to the nearest real data point: For each point qi ∈ random D, find it’s nearest neighbour qj in D; then compute the distance between qi and qj and denote it yi=dist(qi,qj)
wjd = []
for j in range(0, m):
w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
wjd.append(w_dist[0][1])
5. Calculate the Hopkins statistic (H) as the mean nearest neighbour distance in the random data set divided by the sum of the mean nearest neighbour distances in the real and across the simulated data set.
H = sum(ujd) / (sum(ujd) + sum(wjd))
We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis.
That is, if H < 0.5, then it is unlikely that D has statistically significant clusters.
If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data.
Summary:
In this article, we described how to assess clustering tendency using the Hopkins statistics. After showing that data is clusterable, the next step is to determine the number of optimal clusters in the data. This will be described in the next chapter.
Please fill free to propose improvement 😊
References
Han, Jiawei, Micheline Kamber, and Jian Pei. 2012. Data Mining: Concepts and Techniques. 3rd ed. Boston: Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00016-2.
Lawson, Richard G., and Peter C. Jurs. 1990. “New Index for Clustering Tendency and Its Application to Chemical Problems.” Journal of Chemical Information and Computer Sciences 30 (1): 36–41. http://pubs.acs.org/doi/abs/10.1021/ci00065a010.