I had a set of python scripts for automating custom ETL jobs (source information is in heterogeneous form, transformations are simple but specific to each source, and destination are different tables in an SQL database).
I started having custom functions for the extraction (typically csv files, but each one with specific characteristics), the transformation (logically) and the loading into the database (with their custom sql query for each table).
Then I started “complicating things to make the whole thing easier”: I created a DB class that included methods for inserting arbitrary data on an arbitrary database, including the possibility to map source fields in the source to target fields on the database… adding post-load queries to update statistics… so I’ve travelled from:
- Each individual load is more or less 80% the same code, so I have n times the 80% of the same code. The code is very flexible but bulky and I don’t feel that it’s “the right way” to do things.
- A huge, complex class including a lot of properties and methods allowing more or less to reduce the code to n calls to methods on this class. The code is much more compact (except for the monstruous DB class that does almost all the job) but more rigid, and I don’t feel neither that it’s “the right way” to do things.
Somewhere in the middle there must be the right spot. Do you have any hints or advice that indicates how to know when to stop encapsulating? You can use the same example (an ETL of heterogeneous data) in the explanation. Thanks