Platform and SoM Knowledge Pool

What you will need to know about model quantization.

Written by Satoru Kumashiro | Nov 24, 2025 5:59:59 PM

What is AI model quantization?

AI model quantization is a technique to intentionally reduce the precision numbers used by the neural network. Below are some pros and cons of quantizing AI models.

Pros Cons
  • Reduce the model size
  • Reduce the computational resources and memory
  • Reduce the power consumption for AI inference
  • Faster inference
  • Reduced accuracy
  • Additional process for quantization
  • Knowledge required for quantized model optimization
  • Not all models are suitable for quantization

 

How does it work conceptually?

In the neural network, there are nodes (neuron), which typically have multiple inputs and an output. The output is called activation. Each input has a weight to signify its importance. An activation function also has bias as a parameter. A neuron's output can be an input of other neuron.

When FP32 is used for AI models, the weight, the bias and the activation are represented by 32-bit floating point values. Typically, the weight and the activation are quantized whereas the bias sometimes is kept the same.

When the number of parameters is mentioned to represent the model size, it primarily represents the number of weights and biases. So, when 30 millions parameters exists in an AI model, each parameter needs 4 bytes memory when the parameter is FP32. That means about 240M bytes of the memory can be required for the AI inference with the model. In general, mathematical process using floating point value takes more time compared with integer value processing, too. 

The quantization is a technique to convert higher-precision value (e.g. FP32) into lower-precision value (e.g. INT8). A right figure is a concept of the quantization. In this example, an original parameter that can be represented by 6 bits is quantized to a 2 bit parameter. In the original parameters, there is more granularity as indicated by the color gradation. When the parameter is quantized to the lower-precision value, the granularity is lost, but it has less possible values.
 

Type of quantization

Type Description Pros Cons
Static quantization Quantize both weights and activations statically
Fast inference
Less accuracy degradation compared with dynamic method.
Require calibration data
Dynamic quantization Quantize the weights statically while activations are quantized dynamically during inference
Calibration data not required
Suitable for models that the calibration is difficult to perform.
Inference time may not improve as desired.
Quantization-aware training Integrate quantization into training process Best outcome of AI inference Too complex to start

 

Which quantization is supported for the EP-200Q?

The static quantization is supported by a development kit from Silex. Our development kit for AI application development is based on Qualcomm's SDK.

The required calibration data can be a subset of training data set that is used to generate an original AI model.

Contact us to learn more.

Contact us