# python – Optimize K-Mean for large number of clusters

I am writing a Python code for KMeans clustering.

The aim of this post is to find out how I can make my below mentioned code optimal when the number of clusters is very large. I am dealing with data with tens of millions of samples and scores of features. The number of clusters that I am talking about is of the order of 500000. I have checked the correctness of the code by comparing that with results from scikit-learn’s implementation of KMeans clustering.

I am aware that I can use sckit learn for KMeans clustering. But, I am writing my own custom Python code for a few purposes which include but are not limited to the below reasons.

• To be able to change distance functions for my specific work assignments : from Euclidean distance to Huang distance.
• To use this custom code for various other tasks like anomaly detection.

Below is my code:

``````class KMeans:
def __init__(self, arr, k):
self.arr = arr
self.k = k

# Main driver function
def loop_over_centers(self, max_iter):

# Choose random centroids.
random_centers = np.random.choice(len(self.arr), self.k, replace=False)

# Iterate till a certain max number of iterations.
for iteration in range(0, max_iter, 1):
cluster_dict = defaultdict(list)

if iteration == 0:
chosen_centers = np.array((0,3,2)) #np.array((4,5,2)) #random_centers

for i in range(len(self.arr)):
distance_list = ()
for center in chosen_centers:
distance_list += (self.distance(self.arr(center), self.arr(i)))

cluster_dict(chosen_centers(np.argmin(distance_list))) += (i)

chosen_centers = self.update_center(cluster_dict)

return cluster_dict

# Distance between two arrays
def distance(self, arr1, arr2):
return np.sqrt(np.sum(np.subtract(arr1, arr2) ** 2))

# Gives new centroids with each new iteration
def update_center(self, cluster_dict):
cluster_center = ()

for cluster in list(cluster_dict.keys()):
cluster_list = ()

# Find samples corresponding to each cluster.
for sample in cluster_dict(cluster):
cluster_list += (self.arr(sample))

# Mean of samples in a given cluster.
cluster_mean = np.mean(np.array(cluster_list), axis=0).tolist()

distance_from_mean = ()

# Find which sample is closest to mean of cluster.
for sample in cluster_dict(cluster):
distance_from_mean += (self.distance(cluster_mean, self.arr(sample)))

# Build an updated list of cluster centers.
cluster_center += (cluster_dict(cluster)(np.argmin(distance_from_mean)))

return cluster_center

arr = ((12, 3, 5), (1, 6, 7), (8, 92, 1), (6, 98, 2), (5, 90, 3), (29, 5, 6), (11,4,4), (30, 6, 5),)

testKMeans = KMeans(arr, 3)
testKMeans.loop_over_centers(1000)
``````

In a real world case, `arr` will have tens of millions of samples and I would like to have about 500000 cluster centroids. At present the performance of code is okay when `arr` has tens of millions of samples and number of cluster centroids is small (10-20). But is very slow when the number of clusters is very large (hundreds of thousands).

How can I improve the runtime performance of this code for large number of cluster centers? Thanks.

Posted on Categories Articles