After a convolution, a layer takes the biggest number from small areas. What two main things does this kind of pooling help the network do?
This kind of pooling is called max pooling. It helps the network do two main things. First, it performs dimensionality reduction. After a convolution layer processes input data and generates feature maps (grids highlighting detected features), max pooling reduces the spatial size of these maps by selecting only the most prominent activation from each small, non-overlapping region. For example, a 2x2 area of values in a feature map might be replaced by a single value: the maximum from that 2x2 area. This reduction in size means subsequent layers in the network have fewer inputs to process, which significantly decreases the number of parameters (the adjustable weights and biases the network needs to learn) and speeds up computation. Fewer parameters also help to reduce overfitting, a phenomenon where the network learns the training data too specifically and performs poorly on new, unseen data. Second, max pooling introduces translational invariance. By taking the maximum value from a local region, the exact, precise position of a detected feature within that region becomes less critical. If a feature, such as an edge or a corner, shifts slightly within the small pooling window, the maximum value indicating its presence often remains the same or very similar. This makes the network more robust to minor shifts, rotations, or distortions in the input data, allowing it to recognize the feature irrespective of its precise location within a small area. This tolerance to small shifts is vital for the network to generalize effectively to new, varied examples.