In the Mask R-CNN paper the optimizer is described as follows training on MS COCO 2014/2015 dataset for instance segmentation (I believe this is the dataset, correct me if this is wrong)

We train on 8 GPUs (so effective minibatch

size is 16) for 160k iterations, with a learning rate of

0.02 which is decreased by 10 at the 120k iteration. We

use a weight decay of 0.0001 and momentum of 0.9. With

ResNeXt (45), we train with 1 image per GPU and the same

number of iterations, with a starting learning rate of 0.01.

I’m trying to write an optimizer and learning rate scheduler in Pytorch for a similar application, to match this description.

For the optimizer I have:

```
def get_Mask_RCNN_Optimizer(model, learning_rate=0.02):
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, weight_decay=0.0001)
return optimizer
```

For the learning rate scheduler I have:

```
def get_MASK_RCNN_LR_Scheduler(optimizer, step_size):
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gammma=0.1, verbose=True)
return scheduler
```

When the authors say “decreased by 10” do they mean divide by 10? Or do they literally mean subtract by 10, in which case we have a negative learning rate, which seems odd/wrong. Any insights appreciated.