python – Optimize K-Mean for large number of clusters

I am writing a Python code for KMeans clustering.

The aim of this post is to find out how I can make my below mentioned code optimal when the number of clusters is very large. I am dealing with data with tens of millions of samples and scores of features. The number of clusters that I am talking about is of the order of 500000. I have checked the correctness of the code by comparing that with results from scikit-learn’s implementation of KMeans clustering.

I am aware that I can use sckit learn for KMeans clustering. But, I am writing my own custom Python code for a few purposes which include but are not limited to the below reasons.

  • To be able to change distance functions for my specific work assignments : from Euclidean distance to Huang distance.
  • To use this custom code for various other tasks like anomaly detection.

Below is my code:

class KMeans:
    def __init__(self, arr, k):
        self.arr = arr
        self.k = k

    # Main driver function
    def loop_over_centers(self, max_iter):

        # Choose random centroids.
        random_centers = np.random.choice(len(self.arr), self.k, replace=False)

        # Iterate till a certain max number of iterations.
        for iteration in range(0, max_iter, 1):
            cluster_dict = defaultdict(list)

            if iteration == 0:
                chosen_centers = np.array((0,3,2)) #np.array((4,5,2)) #random_centers

            for i in range(len(self.arr)):
                distance_list = ()
                for center in chosen_centers:
                    distance_list += (self.distance(self.arr(center), self.arr(i)))

                cluster_dict(chosen_centers(np.argmin(distance_list))) += (i)

            chosen_centers = self.update_center(cluster_dict)

        return cluster_dict

    # Distance between two arrays
    def distance(self, arr1, arr2):
        return np.sqrt(np.sum(np.subtract(arr1, arr2) ** 2))

    # Gives new centroids with each new iteration
    def update_center(self, cluster_dict):
        cluster_center = ()

        for cluster in list(cluster_dict.keys()):
            cluster_list = ()

            # Find samples corresponding to each cluster.
            for sample in cluster_dict(cluster):
                cluster_list += (self.arr(sample))

            # Mean of samples in a given cluster.
            cluster_mean = np.mean(np.array(cluster_list), axis=0).tolist()

            distance_from_mean = ()

            # Find which sample is closest to mean of cluster.
            for sample in cluster_dict(cluster):
                distance_from_mean += (self.distance(cluster_mean, self.arr(sample)))

            # Build an updated list of cluster centers.
            cluster_center += (cluster_dict(cluster)(np.argmin(distance_from_mean)))

        return cluster_center

arr = ((12, 3, 5), (1, 6, 7), (8, 92, 1), (6, 98, 2), (5, 90, 3), (29, 5, 6), (11,4,4), (30, 6, 5),)

testKMeans = KMeans(arr, 3)

In a real world case, arr will have tens of millions of samples and I would like to have about 500000 cluster centroids. At present the performance of code is okay when arr has tens of millions of samples and number of cluster centroids is small (10-20). But is very slow when the number of clusters is very large (hundreds of thousands).

How can I improve the runtime performance of this code for large number of cluster centers? Thanks.