An optimizer helps a neural network learn by moving its weights step-by-step. What key information does an optimizer like Adam use from the loss function to know which way to move?
An optimizer like Adam primarily uses the gradient of the loss function with respect to the network's weights to determine which way to move. The loss function is a mathematical measure of how well the neural network is performing, quantifying the discrepancy between its predictions and the actual target values. The gradient is a vector that points in the direction of the steepest increase of this loss function. To minimize the loss, the optimizer moves the weights (the adjustable parameters within the network) in the exact opposite direction of this gradient, essentially taking a step downhill on the loss landscape. This tells the optimizer the most direct path to reduce the error.
Beyond this fundamental gradient information, Adam refines its understanding of the optimal movement by incorporating historical gradient information. It calculates two main exponentially weighted averages, known as moment estimates, to guide its steps.
The first key piece of information Adam uses is an exponentially weighted average of past gradients, referred to as the first moment estimate. This acts like momentum, giving inertia to the weight updates. It helps the optimizer maintain a consistent direction of movement, smoothing out erratic gradient fluctuations and accelerating convergence in relevant directions, which can also help overcome shallow local minima in the loss landscape.
The second key piece of information Adam uses is an exponentially weighted average of past squared gradients, referred to as the second moment estimate. This is crucial for providing an adaptive learning rate for each individual weight. By observing the average magnitude of past gradients (through their squares), Adam can adjust the step size for each parameter: it takes larger steps for weights that have historically had small or infrequent gradients, and smaller steps for weights that have consistently experienced large gradients. This allows for more nuanced and efficient updates tailored to each specific weight, preventing oscillations in steep areas and accelerating progress in flat areas.
Both the first and second moment estimates are initially biased towards zero, especially at the beginning of training. Therefore, Adam applies a bias correction to these estimates to ensure they are more accurate representations of the true moments from the outset.
Finally, Adam combines these bias-corrected first and second moment estimates. The first moment estimate primarily dictates the effective direction of the step, similar to how momentum guides motion, while the square root of the second moment estimate adaptively scales the magnitude of that step for each individual parameter, resulting in a precise and efficient update to the weights.