python calls to the Constant Time function that iterates through a Numpy array results in very slow code

I have the following code snippet, which basically does the following:
In the face of a 2d numpy array arr, to calculate sum_arr as follows:
$$ sum _arr[i, j] = begin {cases}
arr[i, j] + min (sum _arr[i – 1, j-1:j+2]), & i> 0 \ arr[i, j], & i = 0 end {cases} $$

(reasonable indexes for j - 1: j + 2 of course everything within 0 and w)

Here is my implementation:

import numpy as np

h, w = 1000, 1000 # form of the 2d array
arr = np.arange (h * w) .reshape ((h, w))

sum_arr = arr.copy ()

def min_parent (i, j):
min_index = j
if j> 0:
if sum_arr[i - 1, j - 1] <sum_arr[i - 1, min_index]:
min_index = j - 1
if j <w - 1:
if sum_arr[i - 1, j + 1] <sum_arr[i - 1, min_index]:
min_index = j + 1
return (i - 1, min_index)


for i, j in np.ndindex ((h-1, w)):
sum_arr[i + 1, j] + = sum_arr[min_parent(i + 1, j)]

And here's the problem: This code snippet takes way too long to run only for 1e6 operations (on average about 5 seconds on my computer).

What is a better way to do this?