I’m working on a data science project that involves data ingestion, cleaning and various aggregations and functions of timeseries equipment data primarily using Pandas. I began the project just writing various functions from the ground up, but am now at a point where I need to make critical decisions about the codebase structure before it gets out of hand.
I don’t have much experience creating an object-oriented framework, but suspect that this would be the correct way to go in my case. However, I want to use best practices and introduce OOP, but only if it actually provides utility.
Here’s a general overview of what the codebase does (with a reference to the dummy classes I created in the dummified code that follows:
Ingests data from a specified dir (GetData class)
Performs data cleaning (GetData class)
Creates a set of new variables at the same granular timestamp level. What makes this tricky is that are functions that create 1 variable for the entire dataset, but also functions that create 1 value for each piece of specific equipment type (PreppedData class)
Group raw data by different temporal aggregations – year, year/month, year/week, year/dayofyear/hour (GroupedObject class)
Concatenate both the raw data and grouped raw data into a list and then deploy the same set of functions on that list (AnalyzedData)
Write outputs from step 5 to Excel tabs. I don’t have a class for this – not sure whether this should just be a side-effect in every detailed function
The AnalyzedData class is a bit complex because we need functions that know whether the object is at the granular level (and thus doesn’t require an apply statement to perform an operation on the groups) or not.
import pandas as pd class GetData(imp_format): def __init__(self): if imp_format=='csv': files = glob.glob(path+'*.csv') for i in files: df = pd.read_csv(#etc. df_lst.append(df) elif imp_format=='excel': # similar process comb_data = pd.concat(df_lst, axis=0) self.data = comb_data def clean_data(): pass class PreppedData(): # methods to take raw data and create new fields with same time aggregation def method_1(): # returns 1 new variable for each timestamp def method_2(self, equipment_type_var_names) # returns 1 value for each timestamp for each specific equipment type # so timestamp in index, and columns would be discrete pieces of equipment # etc. class GroupedObject(): def __init__(self, input_data): hourly = input_data.groupby(input_data.index.year, input_data.index.dayofyear, input_data.index.hour) daily = input_data.groupby(input_data.index.year, input_data.index.dayofyear) # etc self.data_lst = (input_data, hourly, daily, #etc.) class AnalyzedData(): # calculates weighted mean of var_name with weight_var as the weight def weighted_mean_helper(df, var_name, weight_var): # calculates weighted mean of every variable in the var_name list with weight_var as the weight def weighted_mean_apply_helper(df, var_lst, weight_var) def calc_custom_weighted_mean_overall(input_list, mean_var, weight_var): result_lst = () for i in input_list: if type(i)==pandas.core.groupby.generic.DataFrameGroupBy: # note that this would give error, not sure how to check if object is df groupby temp = df.apply(weighted_mean_helper, var_name=mean_var, weight_var=weight_var) result_lst.append() else: temp = weighted_mean_helper(df, var_name, weight_var) result_lst.append(temp) def calc_custom_weighted_mean_by_equipment(input_list, equipment_type_var_names, weight_var): result_set = () for i in input_list: if type(i)==pandas.core.groupby.generic.DataFrameGroupBy:: temp = i.apply(weighted_mean_apply_helper, var_lst = equipment_type_var_names, weight_var=weight_var) # consolidate all results into a dataframe res_cons = ### result_set.append(res_cons) else: for j in equipment_type_var_names: temp = weighted_mean_helper(df((j, weight_var))) result_set.append(temp)
Although this structure seems roughly ok, except for the GetData class, which could serve as a parent class for subclasses that are specific to the api of the datasource, I’m not able to articulate to myself why I shouldn’t just keep this structure in function form without adding classes, similar to how I have it now.