Describe the fundamental difference between stochastic gradient descent (SGD) and Adam optimizer.
The fundamental difference between Stochastic Gradient Descent (SGD) and the Adam optimizer lies in how they update the parameters of a model during training. SGD updates the parameters using a single, global learning rate for all parameters. It computes the gradient of the loss function with respect to the parameters using a single randomly selected data point (or a small batch of data points) and then updates the parameters in the direction opposite to the gradient, scaled by the learning rate. The Adam optimizer, on the other hand, adapts the learning rates for each parameter individually. It computes an exponentially decaying average of past gradients (first moment) and an exponentially decaying average of past squared gradients (second moment) to estimate the mean and variance of the gradients for each parameter. The Adam optimizer then uses these estimates to adjust the learning rate for each parameter, giving larger learning rates to parameters with smaller historical gradients and smaller learning rates to parameters with larger historical gradients. This adaptive learning rate adjustment allows the Adam optimizer to converge faster and more reliably than SGD, especially in complex and non-convex optimization landscapes. For example, if a parameter consistently receives small gradients, the Adam optimizer will increase its learning rate to accelerate its learning. Conversely, if a parameter receives large and volatile gradients, the Adam optimizer will decrease its learning rate to stabilize its learning. In summary, SGD uses a single, global learning rate for all parameters, while Adam uses adaptive learning rates that are adjusted individually for each parameter based on their historical gradients.