I am writing a Python code for KMeans clustering.

The aim of this post is to find out **how I can make my below mentioned code optimal when the number of clusters is very large.** I am dealing with **data with tens of millions of samples and scores of features**. The n**umber of clusters that I am talking about is of the order of 500000**. I have checked the correctness of the code by comparing that with results from scikit-learn’s implementation of KMeans clustering.

I am aware that I can use sckit learn for KMeans clustering. But, I am writing my own custom Python code for a few purposes which include but are not limited to the below reasons.

- To be able to change distance functions for my specific work assignments : from Euclidean distance to Huang distance.
- To use this custom code for various other tasks like anomaly detection.

Below is my code:

```
class KMeans:
def __init__(self, arr, k):
self.arr = arr
self.k = k
# Main driver function
def loop_over_centers(self, max_iter):
# Choose random centroids.
random_centers = np.random.choice(len(self.arr), self.k, replace=False)
# Iterate till a certain max number of iterations.
for iteration in range(0, max_iter, 1):
cluster_dict = defaultdict(list)
if iteration == 0:
chosen_centers = np.array((0,3,2)) #np.array((4,5,2)) #random_centers
for i in range(len(self.arr)):
distance_list = ()
for center in chosen_centers:
distance_list += (self.distance(self.arr(center), self.arr(i)))
cluster_dict(chosen_centers(np.argmin(distance_list))) += (i)
chosen_centers = self.update_center(cluster_dict)
return cluster_dict
# Distance between two arrays
def distance(self, arr1, arr2):
return np.sqrt(np.sum(np.subtract(arr1, arr2) ** 2))
# Gives new centroids with each new iteration
def update_center(self, cluster_dict):
cluster_center = ()
for cluster in list(cluster_dict.keys()):
cluster_list = ()
# Find samples corresponding to each cluster.
for sample in cluster_dict(cluster):
cluster_list += (self.arr(sample))
# Mean of samples in a given cluster.
cluster_mean = np.mean(np.array(cluster_list), axis=0).tolist()
distance_from_mean = ()
# Find which sample is closest to mean of cluster.
for sample in cluster_dict(cluster):
distance_from_mean += (self.distance(cluster_mean, self.arr(sample)))
# Build an updated list of cluster centers.
cluster_center += (cluster_dict(cluster)(np.argmin(distance_from_mean)))
return cluster_center
arr = ((12, 3, 5), (1, 6, 7), (8, 92, 1), (6, 98, 2), (5, 90, 3), (29, 5, 6), (11,4,4), (30, 6, 5),)
testKMeans = KMeans(arr, 3)
testKMeans.loop_over_centers(1000)
```

In a real world case, `arr`

will have tens of millions of samples and I would like to have about 500000 cluster centroids. At present the performance of code is okay when `arr`

has tens of millions of samples and number of cluster centroids is small (10-20). But is very slow when the number of clusters is very large (hundreds of thousands).

**How can I improve the runtime performance of this code for large number of cluster centers?** Thanks.