Question

When transitioning a neural network layer from FP32 to BF16, why is the numerical stability of the training process generally better preserved compared to a transition to standard FP16?

Accepted Answer

Neural network training involves floating-point numbers which consist of a sign bit, an exponent, and a fraction or mantissa. The exponent determines the range of numbers a format can represent, while the mantissa determines the precision or the density of representable numbers. FP32 uses 32 bits, with an 8-bit exponent and a 23-bit mantissa. Standard FP16 uses 16 bits, with a 5-bit exponent and a 10-bit mantissa. Because FP16 has a very small exponent, its dynamic range is narrow, causing it to easily overflow values that are too large or underflow values that are too small. This forces developers to use loss scaling, a process of multiplying the loss by a constant factor to keep values within the representable range. BF16, or Brain Floating Point, also uses 16 bits but assigns 8 bits to the exponent and 7 bits to the mantissa. Because the exponent in BF16 is identical in size to the one in FP32, it shares the same dynamic range as FP32. This means BF16 can represent the same range of extremely large and small numbers as FP32, eliminating the need for loss scaling and preventing the overflow and underflow errors that frequently destabilize training in FP16. While BF16 has lower precision than FP16 due to its smaller 7-bit mantissa, neural networks are generally highly tolerant of reduced precision, but they are extremely sensitive to the range errors caused by the limited exponent of FP16.

Home → All Courses → Engineering and Technology Courses → High-Performance GPU Programming for AI Workloads → Flashcard

When transitioning a neural network layer from FP32 to BF16, why is the numerical stability of the training process generally better preserved compared to a transition to standard FP16?