The core function of model quantization is to reduce the bit-width of numerical values used to represent a neural network's parameters, typically moving from high-precision formats like 32-bit floating point numbers to lower-precision formats like 8-bit integers. A neural network's inference speed is often bottlenecked by hardware memory bandwidth, which is the rate at which data can be transferred fr....
Log in to view the answer