Strange behavior with Adam optimizer when training for too long
Strange behavior with Adam optimizer when training for too long
Im trying to train a single perceptron 1000 input units, 1 output, no hidden layers on 64 randomly generated data points. Im using Pytorch using the Adam optimizer: import torch from torch.autograd import Variable torch.manual_seed545345 N, D_in, D_out 64, 1000, 1 x Variabletorch.randnN, D_in y Variabletorch.randnN, D_out model torch.nn.LinearD_in, D_out loss_fn torch.nn.MSELosssize_averageFalse optimizer torch.optim.Adammodel.parameters for t in xrange5000: y_pred modelx loss loss_fny_pred, y printt, loss.data0 optimizer.zero_grad loss.backward optimizer.step Initially, the loss quickly decreases, as expected: 0, 91.74887084960938 1, 76.85824584960938 2, 63.434078216552734 3, 51.46927261352539 4, 40.942893981933594 5, 31.819372177124023 Around 300 iterations, the error reaches near zero: 300, 2.1734419819452455e-12 301, 1.90354676465887e-12 302, 2.3347573874232808e-12 This goes on for a few thousand iterations. However, after training for too long, the error starts to increase again: 4997, 0.002102422062307596 4998, 0.0020302983466535807 4999, 0.0017039275262504816 Why is this happening
This small instability at the end of convergence is a feature of Adam and RMSProp due to how it estimates mean gradient magnitudes over recent steps and divides by them. One thing Adam does is maintain a rolling geometric mean of recent gradients and squares of the gradients. The squares of the gradients are used to divide another rolling mean of the current gradient to decide the current step. However, when your gradient becomes and stays very close to zero, this will make the squares of the gradient become so low that they either have large rounding errors or are effectively zero, which can introduce instability for instance a long-term stable gradient in one dimension makes a relatively small step from 10-10 to 10-5 due to changes in other params, and the step size will start to jump around, before settling again. This actually makes Adam less stable and worse for your problem than more basic gradient descent, assuming you want to get as numerically close to zero loss as calculations allow for your problem. In practice on deep learning problems, you dont get this close to convergence and for some regularisation techniques such as early stopping, you dont want to anyway, so it is usually not a practical concern on the types of problem that Adam was designed for. You can actually see this occurring for RMSProp in a comparison of different optimisers RMSProp is the black line - watch the very last steps just as it reaches the target: You can make Adam more stable and able to get closer to true convergence by reducing the learning rate. E.g. optimizer torch.optim.Adammodel.parameters, lr1e-5 It will take longer to optimise. Using lr1e-5 you need to train for 20,000 iterations before you see the instability and the instability is less dramatic, values hover around 10-7.
The reason is exactly as mentioned in the other answer with a great suggestion to use a smaller learning rate to avoid this problem around small gradients. I can think of a couple of approaches: You can clip the gradients with an upperlower bound but this does not guarantee convergence and may result in training freeze by getting trapped in some local minima and never get out of it. Train with a higher batch size, more epochs and with a decayed learning rate. Now I do not have any practical proof that increasing a batch size results in better gradients but from what I had observed by facing problems similar to yours, doing so has almost always helped. I am sure there are other methods like cyclical learning rate etc that try to find an optimal learning rate based on statistics.
Комментарии
Отправить комментарий