performance – Specialized Normalization Running Slow in Pytorch


I have a tensor of shape z = (38, 38, 7, 7, 21) = (x_pos, y_pos, grid_i, grid_j, class_num), and I wish to normalize it according to the formula:
enter image description here

I have produced a working example of what I mean here, and the problem is that it is extremely slow, approximately 2-3 seconds for each grid entry (of which there are 49, so 49*3 seconds = 147 seconds, which is way too long, considering I need to do this with thousands of image feature maps).
Any optimizations or obvious problems very much appreciated. This is part of a Pytorch convolutional neural network architecture, so I am using torch tensors and tensor ops.

import torch
def normalizeScoreMap(score_map):
    for grid_i in range(7):
        for grid_j in range(7):
            for x in range(38):
                for y in range(38):
                    grid_sum = torch.tensor(0.0).cuda()
                    for class_num in range(21):
                        grid_sum += torch.pow(score_map(x)(y)(grid_i)(grid_j)(class_num), 2)
                    grid_normalizer = torch.sqrt(grid_sum)
                    for class_num in range(21):
                        score_map(x)(y)(grid_i)(grid_j)(class_num) /= grid_normalizer
    return score_map

random_score_map = torch.rand(38,38,7,7,21).cuda()
score_map = normalizeScoreMap(random_score_map)

Edit: For reference I have an i9-9900K CPU and a nvidia 2080 GPU, so my hardware is quite good. I would be willing to try multi-threading but I am looking for more obvious problems/optimizations.