<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1332818711964721&amp;ev=PageView&amp;noscript=1">

Silex Knowledge Pool

Tips for quantizing AI models to INT8 models for EP-200Q

AI model quantization tips

AI models often concatenate outputs from multiple nodes. Combining the values with difference scales are fine as long as they hold their values in original format (in FP32 model, 32-bit floating point). However, it could cause hiccup when quantizing the model.

Here is how such models can be quantized properly.

First step: Know your model.

If you are a creator of your model to be quantized, you may know how the model is structured. However, if you are trying to start with a model available in public and do not know much about details, you may need to understand how the model is structured. In this case, Netron can be the tool to visualize your model though not all model format are supported. Here is a screenshot of a part of a model displayed on Netron. 

Second step: Understand output tensor shape and contained values.

As an example, a certain object detection, pose estimation model concatenates XY coordination values and score values as an output. When these values are handled as floating point values, the post processing application can handle these values properly. However, when such model is quantized by a tool, it could cause an issue because the tool could treat them equally without knowing which one is for XY coordination and which one is score. The coordination value's scale depends on the input image size whereas the score value's scale is 0 to 1. If they are mixed up and quantized equally, the score value will not be quantized properly. Let's say coordination is from 0 to 640. In this case, this range is quantized to 0 to 255. If the score value is just 0 or 1, quantization with the same range scale results in the score value always 0.

Third step: Separate concatenated nodes accordingly.

The simplest way to avoid the impact of quantization on output values is to modify the model so that each output value has its own independent output node.
To make this modification, model adjustment tools compatible with the AI model format can be used. For example, for ONNX models, it is convenient to use a GUI tool called ONNX modifier.
Qualcomm also provides a tool to handle specific nodes as individual output nodes. The Qualcomm's tool will be built in when the development environment is prepared by a Docker file provided by Silex.

Forth step: Quantize the mode and add cache information.

For the EP-200Q, the quantization can be done through the Docker environment provided by Silex. Please refer to "AI Application Migration Guide" which is available for customers under NDA.

As written in this article, the quantization process for the EP-200Q supports the static quantization. Using calibration data that is expected to be used during operation can help prevent accuracy degradation due to quantization. When quantizing an AI model, the minimum and maximum values of the input and output of the AI model are calculated using calibration data, and the quantization parameters are determined so that this range can be expressed. Therefore, if the expected data is unknown, include data in the calibration data that covers the minimum and maximum values of the input and output ranges. 

Tips for the calibration data

Example of the calibration data for pose estimation model used in a patient room:

A dataset containing various patient poses in different positions in the room, such as standing, walking, and lying down.

Contact us