Govur University Logo
--> --> --> -->
...

During the Transformer forward pass, why does the Pre-LN configuration typically result in more stable gradient flow during the initial stages of training compared to the Post-LN configuration?



In a Transformer, the residual connection is a structure where the input to a layer is added to the output of that layer, mathematically represented as x + Sublayer(x). The Layer Normalization (LN) process standardizes the mean and variance of inputs to keep them within a stable numerical range. In the Post-LN configuration, normalization is applied after the residual addition, meaning the output of the layer is normalized before it is passed to the next re....

Log in to view the answer



Redundant Elements