I have the following code snippet, which basically does the following:

In the face of a 2d numpy array `arr`

, to calculate `sum_arr`

as follows:

$$ sum _arr[i, j] = begin {cases}

arr[i, j] + min (sum _arr[i – 1, j-1:j+2]), & i> 0 \ arr[i, j], & i = 0 end {cases} $$

(reasonable indexes for `j - 1: j + 2`

of course everything within `0`

and `w`

)

Here is my implementation:

```
import numpy as np
h, w = 1000, 1000 # form of the 2d array
arr = np.arange (h * w) .reshape ((h, w))
sum_arr = arr.copy ()
def min_parent (i, j):
min_index = j
if j> 0:
if sum_arr[i - 1, j - 1] <sum_arr[i - 1, min_index]:
min_index = j - 1
if j <w - 1:
if sum_arr[i - 1, j + 1] <sum_arr[i - 1, min_index]:
min_index = j + 1
return (i - 1, min_index)
for i, j in np.ndindex ((h-1, w)):
sum_arr[i + 1, j] + = sum_arr[min_parent(i + 1, j)]
```

And here's the problem: This code snippet takes way too long to run only for 1e6 operations (on average about 5 seconds on my computer).

What is a better way to do this?