There is no reason to open the
product.csv file twice.
The first time you open the file, you are reading and storing the entire file in memory. Instead of opening and reading the file a second time line by line, you could simply loop over the
with open(r'product.csv', 'r') as file2: my_reader2 = csv.reader(file2, delimiter=';') with open('product_p.csv', "w") as csvfile_write: ecriture = csv.writer(csvfile_write, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL) for row in my_reader2: ...
with open('product_p.csv', "w") as csvfile_write: ecriture = csv.writer(csvfile_write, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL) for row in list_csv: ...
With one million rows in your file, your $O(N^2)$ algorithm is visiting one trillion row pairings. With each row pairing, you convert strings to integers up to 4 times with
int(row(1)) >= int(comp(1)) and int(row(2)) <= int(comp(2)). That is four trillion string-to-integer conversions!
If instead, as you read the file the first time, you convert the three strings into integer values and stored the already converted values in
list_csv, validating in the process, your later code would cleaner and faster.
with open(r'product.csv', 'r') as file1: my_reader1 = csv.reader(file1, delimiter=';') list_csv = () for row in my_reader1: try: product = row(0) pos_min, pos_max, count_pos = map(int, row(1:)) list_csv.append((product, pos_min, pos_max, count_pos)) except ValueError: pass # No int(...) or try-except required beyond this point
try: ... except: pass is bad, bad, bad.
What exception is expected?
except ValueError: is better. It lets the reader know what type of exception might occur, giving context.
It is also for your benefit. What if you accidentally wrote
... int(cmp(2)) ... inside
try: ... except: pass? You would get one trillion
NameError exceptions raised and suppressed. After a long, long period of time, the program would complete and write an empty result file, leaving you with no clue what went wrong. Narrowing the type of exception caught to a
ValueError would immediately reveal the issue.
It would be much better to actually handle the error, instead of suppressing it. Print out a message explaining what the problem was: what row failed to convert properly. If there are known valid reasons why the conversion might fail, check that the failure was expected, and report if the failure was not of the expected variety.
Giving names to your data structure elements can help readability.
comp(3)? It is very difficult to tell from the code.
... from typing import NamedTuple class Product(NamedTuple): product: str pos_min: int pos_max: int count_pos: int ... with open(r'product.csv', 'r') as file1: my_reader = csv.reader(file1, delimiter=';') products = () for row in my_reader: try: product = row(0) pos_min, pos_max, count_pos = map(int, row(1:)) products.append(Product(product, pos_min, pos_max, count_pos)) except ValueError: pass for row in products: ... for comp in products: if row.pos_min >= comp.pos_min and row.pos_max <= comp.pos_max and row.product != comp.product: res(row.product).append((comp.product, comp.count_pos)) ...
This is immediately more readable. We’re not comparing some arbitrary thing at indices 1, 2, and 0; we’re comparing named things.
Speaking of names …
products. It was read from a CSV file, but the elements are no longer comma-separated-values. The commas are long gone. Its just a list of products.
comp could use better names too.
I’ve also replaced:
for k in range(len(list_csv)): comp= list_csv(k)
with loop directly over the values of the list. There is no need for the
k index variable.
With the above changes, your algorithm is still the same, which means it is still $O(N^2)$.
Can we improve it? Sorting is $O(N log N)$ which is better than $O(N^2)$, so is an easy thing to try. What can we sort on? Would sorting on
position_min help us?
Yes, it would, but we need to be careful.
With products sorted on
row.pos_min >= comp.pos_min would be true for all
comp items occurring earlier than
row in the
products list. We might be tempted to loop
for row_idx in range(len(products)), and
for comp_idx in range(row_idx), but we’d introduce a bug:
Several items sorted after
row might still have an
pos_min equal to
row.pos_min, so we can’t just loop up to the current row. But we can stop when we reach the first
min_pos value larger than
sort(products, key=...) for row in products: limit = row.min_pos for comp in products: if comp.min_pos > limit: break if row.max_pos <= comp.max_pos and row.product != comp.product: res(row.product).append((comp.product, comp.count_pos)) ...
One trillion row-pairings just dropped to one-half trillion row-pairings. It is still $O(N^2)$, but we cut the time in half.