Question

When training a CNN, why does adding a Batch Normalization layer before the activation function help reduce the internal covariate shift?

Accepted Answer

Internal covariate shift refers to the change in the distribution of layer inputs as the parameters of previous layers change during training. As a neural network learns, the weights in earlier layers are updated, which causes the output values (activations) fed into subsequent layers to shift in range and distribution. This forces later layers to constantly adapt to new input statistics, slowing down the training process. Batch Normalization addresses this by normalizing the inputs of a layer to have a mean of zero and a variance of one for each mini-batch of data. By placing this normalization step before the activation function, the distribution of the inputs remains stable regardless of how much the weights in previous layers change. Because the inputs to the activation function are constrained to a fixed distribution, the activation function receives a consistent range of values. This prevents inputs from drifting into the saturated regions of activation functions, such as the flat tails of a sigmoid or tanh function where gradients become near zero. Consequently, the gradients remain healthy and informative, allowing for higher learning rates and faster convergence during backpropagation, as the network does not need to repeatedly readjust to shifting input distributions.

Home → All Courses → Engineering and Technology Courses → Computer Vision Engineering → Flashcard

When training a CNN, why does adding a Batch Normalization layer before the activation function help reduce the internal covariate shift?