performance – Non equi self join with basic python libraries

There is no reason to open the `product.csv` file twice.

The first time you open the file, you are reading and storing the entire file in memory. Instead of opening and reading the file a second time line by line, you could simply loop over the `list_csv`. This:

``````with open(r'product.csv', 'r') as file2:
with open('product_p.csv', "w") as csvfile_write:
ecriture = csv.writer(csvfile_write, delimiter=';',
quotechar='"', quoting=csv.QUOTE_ALL)
...
``````

becomes:

``````with open('product_p.csv', "w") as csvfile_write:
ecriture = csv.writer(csvfile_write, delimiter=';',
quotechar='"', quoting=csv.QUOTE_ALL)
for row in list_csv:
...
``````

With one million rows in your file, your $$O(N^2)$$ algorithm is visiting one trillion row pairings. With each row pairing, you convert strings to integers up to 4 times with `int(row(1)) >= int(comp(1)) and int(row(2)) <= int(comp(2))`. That is four trillion string-to-integer conversions!

Yeouch!

If instead, as you read the file the first time, you convert the three strings into integer values and stored the already converted values in `list_csv`, validating in the process, your later code would cleaner and faster.

``````with open(r'product.csv', 'r') as file1:
list_csv = ()
try:
product = row(0)
pos_min, pos_max, count_pos = map(int, row(1:))
list_csv.append((product, pos_min, pos_max, count_pos))
except ValueError:
pass

# No int(...) or try-except required beyond this point
``````

`try: ... except: pass` is bad, bad, bad.

What exception is expected? `except ValueError:` is better. It lets the reader know what type of exception might occur, giving context.

It is also for your benefit. What if you accidentally wrote `... int(cmp(2)) ...` inside `try: ... except: pass`? You would get one trillion `NameError` exceptions raised and suppressed. After a long, long period of time, the program would complete and write an empty result file, leaving you with no clue what went wrong. Narrowing the type of exception caught to a `ValueError` would immediately reveal the issue.

It would be much better to actually handle the error, instead of suppressing it. Print out a message explaining what the problem was: what row failed to convert properly. If there are known valid reasons why the conversion might fail, check that the failure was expected, and report if the failure was not of the expected variety.

What is `comp(1)`? `row(2)`? `comp(3)`? It is very difficult to tell from the code.

``````...
from typing import NamedTuple

class Product(NamedTuple):
product: str
pos_min: int
pos_max: int
count_pos: int

...

with open(r'product.csv', 'r') as file1:
products = ()
try:
product = row(0)
pos_min, pos_max, count_pos = map(int, row(1:))
products.append(Product(product, pos_min, pos_max, count_pos))
except ValueError:
pass

for row in products:
...
for comp in products:
if row.pos_min >= comp.pos_min and row.pos_max <= comp.pos_max and row.product != comp.product:
res(row.product).append((comp.product, comp.count_pos))

...
``````

This is immediately more readable. We’re not comparing some arbitrary thing at indices 1, 2, and 0; we’re comparing named things.

Speaking of names …

I’ve changed `list_csv` to `products`. It was read from a CSV file, but the elements are no longer comma-separated-values. The commas are long gone. Its just a list of products.

`row` and `comp` could use better names too.

I’ve also replaced:

``````            for k in range(len(list_csv)):
comp= list_csv(k)
``````

with loop directly over the values of the list. There is no need for the `k` index variable.

With the above changes, your algorithm is still the same, which means it is still $$O(N^2)$$.

Can we improve it? Sorting is $$O(N log N)$$ which is better than $$O(N^2)$$, so is an easy thing to try. What can we sort on? Would sorting on `position_min` help us?

Yes, it would, but we need to be careful.

With products sorted on `position_min`, then `row.pos_min >= comp.pos_min` would be true for all `comp` items occurring earlier than `row` in the `products` list. We might be tempted to loop `for row_idx in range(len(products))`, and `for comp_idx in range(row_idx)`, but we’d introduce a bug:

Several items sorted after `row` might still have an `pos_min` equal to `row.pos_min`, so we can’t just loop up to the current row. But we can stop when we reach the first `min_pos` value larger than `row.pos_min`.

``````
sort(products, key=...)

for row in products:
limit = row.min_pos
for comp in products:
if comp.min_pos > limit:
break
if row.max_pos <= comp.max_pos and row.product != comp.product:
res(row.product).append((comp.product, comp.count_pos))

...
``````

One trillion row-pairings just dropped to one-half trillion row-pairings. It is still $$O(N^2)$$, but we cut the time in half.