performance – Non equi self join with basic python libraries

There is no reason to open the product.csv file twice.

The first time you open the file, you are reading and storing the entire file in memory. Instead of opening and reading the file a second time line by line, you could simply loop over the list_csv. This:

with open(r'product.csv', 'r') as file2:
    my_reader2 = csv.reader(file2, delimiter=';') 
    with open('product_p.csv', "w") as csvfile_write:
        ecriture = csv.writer(csvfile_write, delimiter=';',
                                quotechar='"', quoting=csv.QUOTE_ALL)
        for row in my_reader2:
            ...

becomes:

with open('product_p.csv', "w") as csvfile_write:
    ecriture = csv.writer(csvfile_write, delimiter=';',
                          quotechar='"', quoting=csv.QUOTE_ALL)
    for row in list_csv:
        ...

With one million rows in your file, your $O(N^2)$ algorithm is visiting one trillion row pairings. With each row pairing, you convert strings to integers up to 4 times with int(row(1)) >= int(comp(1)) and int(row(2)) <= int(comp(2)). That is four trillion string-to-integer conversions!

Yeouch!

If instead, as you read the file the first time, you convert the three strings into integer values and stored the already converted values in list_csv, validating in the process, your later code would cleaner and faster.

with open(r'product.csv', 'r') as file1:
    my_reader1 = csv.reader(file1, delimiter=';')
    list_csv = ()
    for row in my_reader1:
        try:
            product = row(0)
            pos_min, pos_max, count_pos = map(int, row(1:))
            list_csv.append((product, pos_min, pos_max, count_pos))
        except ValueError:
            pass

    # No int(...) or try-except required beyond this point

try: ... except: pass is bad, bad, bad.

What exception is expected? except ValueError: is better. It lets the reader know what type of exception might occur, giving context.

It is also for your benefit. What if you accidentally wrote ... int(cmp(2)) ... inside try: ... except: pass? You would get one trillion NameError exceptions raised and suppressed. After a long, long period of time, the program would complete and write an empty result file, leaving you with no clue what went wrong. Narrowing the type of exception caught to a ValueError would immediately reveal the issue.

It would be much better to actually handle the error, instead of suppressing it. Print out a message explaining what the problem was: what row failed to convert properly. If there are known valid reasons why the conversion might fail, check that the failure was expected, and report if the failure was not of the expected variety.

Giving names to your data structure elements can help readability.

What is comp(1)? row(2)? comp(3)? It is very difficult to tell from the code.

Consider instead:

...
from typing import NamedTuple

class Product(NamedTuple):
    product: str
    pos_min: int
    pos_max: int
    count_pos: int

...

with open(r'product.csv', 'r') as file1:
    my_reader = csv.reader(file1, delimiter=';')
    products = ()
    for row in my_reader:
        try:
            product = row(0)
            pos_min, pos_max, count_pos = map(int, row(1:))
            products.append(Product(product, pos_min, pos_max, count_pos))
        except ValueError:
            pass

    for row in products:
        ...
        for comp in products:
            if row.pos_min >= comp.pos_min and row.pos_max <= comp.pos_max and row.product != comp.product:
                res(row.product).append((comp.product, comp.count_pos))

        ...

This is immediately more readable. We’re not comparing some arbitrary thing at indices 1, 2, and 0; we’re comparing named things.

Speaking of names …

I’ve changed list_csv to products. It was read from a CSV file, but the elements are no longer comma-separated-values. The commas are long gone. Its just a list of products.

row and comp could use better names too.

I’ve also replaced:

            for k in range(len(list_csv)):
                comp= list_csv(k)

with loop directly over the values of the list. There is no need for the k index variable.

With the above changes, your algorithm is still the same, which means it is still $O(N^2)$.

Can we improve it? Sorting is $O(N log N)$ which is better than $O(N^2)$, so is an easy thing to try. What can we sort on? Would sorting on position_min help us?

Yes, it would, but we need to be careful.

With products sorted on position_min, then row.pos_min >= comp.pos_min would be true for all comp items occurring earlier than row in the products list. We might be tempted to loop for row_idx in range(len(products)), and for comp_idx in range(row_idx), but we’d introduce a bug:

Several items sorted after row might still have an pos_min equal to row.pos_min, so we can’t just loop up to the current row. But we can stop when we reach the first min_pos value larger than row.pos_min.


sort(products, key=...)

for row in products:
    limit = row.min_pos
    for comp in products:
        if comp.min_pos > limit:
            break
        if row.max_pos <= comp.max_pos and row.product != comp.product:
            res(row.product).append((comp.product, comp.count_pos))

    ...

One trillion row-pairings just dropped to one-half trillion row-pairings. It is still $O(N^2)$, but we cut the time in half.