python – Correct structure of a data science project: keeping as functions versus object-oriented framework

I’m working on a data science project that involves data ingestion, cleaning and various aggregations and functions of timeseries equipment data primarily using Pandas. I began the project just writing various functions from the ground up, but am now at a point where I need to make critical decisions about the codebase structure before it gets out of hand.

I don’t have much experience creating an object-oriented framework, but suspect that this would be the correct way to go in my case. However, I want to use best practices and introduce OOP, but only if it actually provides utility.

Here’s a general overview of what the codebase does (with a reference to the dummy classes I created in the dummified code that follows:

  1. Ingests data from a specified dir (GetData class)

  2. Performs data cleaning (GetData class)

  3. Creates a set of new variables at the same granular timestamp level. What makes this tricky is that are functions that create 1 variable for the entire dataset, but also functions that create 1 value for each piece of specific equipment type (PreppedData class)

  4. Group raw data by different temporal aggregations – year, year/month, year/week, year/dayofyear/hour (GroupedObject class)

  5. Concatenate both the raw data and grouped raw data into a list and then deploy the same set of functions on that list (AnalyzedData)

  6. Write outputs from step 5 to Excel tabs. I don’t have a class for this – not sure whether this should just be a side-effect in every detailed function

The AnalyzedData class is a bit complex because we need functions that know whether the object is at the granular level (and thus doesn’t require an apply statement to perform an operation on the groups) or not.

import pandas as pd

class GetData(imp_format):
    def __init__(self):
        if imp_format=='csv': 
        files = glob.glob(path+'*.csv')
        for i in files:
            df = pd.read_csv(#etc.
        elif imp_format=='excel':
            # similar process
        comb_data = pd.concat(df_lst, axis=0) = comb_data
    def clean_data():

class PreppedData():
    # methods to take raw data and create new fields with same time aggregation
    def method_1():
        # returns 1 new variable for each timestamp
    def method_2(self, equipment_type_var_names)
        # returns 1 value for each timestamp for each specific equipment type
        # so timestamp in index, and columns would be discrete pieces of equipment
    # etc.
class GroupedObject():
    def __init__(self, input_data):
        hourly = input_data.groupby(input_data.index.year, input_data.index.dayofyear, input_data.index.hour)
        daily = input_data.groupby(input_data.index.year, input_data.index.dayofyear)
        # etc
        self.data_lst = (input_data, hourly, daily, #etc.)

class AnalyzedData():   
    # calculates weighted mean of var_name with weight_var as the weight
    def weighted_mean_helper(df, var_name, weight_var):
    # calculates weighted mean of every variable in the var_name list with weight_var as the weight
    def weighted_mean_apply_helper(df, var_lst, weight_var)
    def calc_custom_weighted_mean_overall(input_list, mean_var, weight_var):
        result_lst = ()
        for i in input_list:
            if type(i)==pandas.core.groupby.generic.DataFrameGroupBy:   # note that this would give error, not sure how to check if object is df groupby
                temp = df.apply(weighted_mean_helper, var_name=mean_var, weight_var=weight_var)
                temp = weighted_mean_helper(df, var_name, weight_var)
    def calc_custom_weighted_mean_by_equipment(input_list, equipment_type_var_names, weight_var):
        result_set = ()
        for i in input_list:
            if type(i)==pandas.core.groupby.generic.DataFrameGroupBy::
                temp = i.apply(weighted_mean_apply_helper, var_lst = equipment_type_var_names, weight_var=weight_var)
                # consolidate all results into a dataframe
                res_cons = ###
                for j in equipment_type_var_names:
                    temp = weighted_mean_helper(df((j, weight_var)))

Although this structure seems roughly ok, except for the GetData class, which could serve as a parent class for subclasses that are specific to the api of the datasource, I’m not able to articulate to myself why I shouldn’t just keep this structure in function form without adding classes, similar to how I have it now.