I have the following code to do a majority vote for data in a dataframe:
def vote(df, systems): test = df.drop_duplicates(subset=('begin', 'end', 'case', 'system')) n = int(len(systems)/2) data = () for row in test.itertuples(): # get all matches fx = test.loc((test.begin == row.begin) & (test.end == row.end) & (test.case == row.case)) fx = fx.loc(fx.system.isin(systems)) # keep if in a majority of systems if len(set(fx.system.tolist())) > n: data.append(fx) out = pd.concat(data, axis=0, ignore_index=True) out = out.drop_duplicates(subset=('begin', 'end', 'case')) return out(('begin', 'end', 'case'))
The data look like:
systems = ('A', 'B', 'C', 'D', 'E') df = begin,end,system,case 0,9,A,0365 10,14,A,0365 10,14,B,0365 10,14,C,0365 28,37,A,0366 38,42,A,0366 38,42,B,0366 53,69,C,0366 56,60,B,0366 56,60,C,0366 56,69,D,0366 64,69,E,0366 83,86,B,0367
The expected output should be:
out = begin,end,case 10,14,0365 56,69,0366
IOW, if desired elements
begin, end, case appear in a majority of systems, we accumulate them and return them as a dataframe.
The algorithm works perfectly fine, but since there are hundreds of thousands of rows in it, this is taking quite a while to process.
One optimization I can think of, but am unsure of how to implement is in the
itertuples iteration: If, for the first instance of a filter set
begin, end, case there are matches in
fx = test.loc((test.begin == row.begin) & (test.end == row.end) & (test.case == df.case) & (fx.system.isin(systems)))
then, it would be beneficial to not iterate over the other rows in the
itertuples iterable that match on this filter. For example, for the first instance of
10,14,A,0365 there is no need to check the next two rows, since they’ve already been evaluated. However, since the iterable is already fixed, there is no way to skip these of which I can think.