python – Pandas replace rare values in a pipeline

A common preprocessing in machine learning consists in replacing rare values in the data by a label stating “rare”. So that subsequent learning algorithms will not try to generalize a value with few occurences.

Pipelines enable to describe a sequence of preprocessing and learning algorithms to end up with a single object that takes raw data, treats it, and output a prediction. scikit-learn expects the steps to have a specific syntax (fit / transform or fit / predict). I wrote the following class to take care of this task so that it can be run inside a pipeline. (More details about the motivation can be found here: pandas replace rare values)

Is there a way to improve this code in term of performance or reusability ?

class RemoveScarceValuesFeatureEngineer:

    def __init__(self, min_occurences):
        self._min_occurences = min_occurences
        self._column_value_counts = {}

    def fit(self, X, y):
        for column in X.columns:
            self._column_value_counts(column) = X(column).value_counts()
        return self

    def transform(self, X):
        for column in X.columns:
            X.loc(self._column_value_counts(column)(X(column)).values
                  < self._min_occurences, column) = "RARE_VALUE"

        return X

    def fit_transform(self, X, y):
        self.fit(X, y)
        return self.transform(X)


if __name__ == "__main__":
    import pandas as pd

    sample_train = pd.DataFrame(
        ({"a": 1, "s": "a"}, {"a": 1, "s": "a"}, {"a": 1, "s": "b"}))
    rssfe = RemoveScarceValuesFeatureEngineer(2)
    print(sample_train)
    print(rssfe.fit_transform(sample_train, None))
    print(20*"=")

    sample_test = pd.DataFrame(({"a": 1, "s": "a"}, {"a": 1, "s": "b"}))
    print(sample_test)
    print(rssfe.transform(sample_test))