Govur University Logo
--> --> --> -->
...

Why does using BF16 instead of FP16 for mixed precision training prevent numerical overflow issues during the accumulation of gradients?



Floating point numbers are represented using three components: a sign bit, an exponent (which determines the range), and a mantissa (which determines the precision). FP16 uses 5 bits for the exponent and 10 bits for the mantissa, while BF16 uses 8 bits for the exponent and 7 bits for the mantissa. The overflow issue in deep learning occurs because the gradients during backpropagation often involv....

Log in to view the answer



Redundant Elements