The transition from cloud AI inference to on-device AI inferencing is moving rapidly. There are a couple of key advantages of on-device AI over cloud AI, but there are also some trade-offs.
Pros:
Cons:
Unlike a data centers hosting AI services for a variety of needs such as generative AI or agentic AI with massive resources of GPU, TPU, memories, storage, cooling system etc., edge devices have limited hardware resources. Therefore, it is important to define what you want to achieve with on-device AI to design an edge device with optimal components satisfying the performance requirements and the cost target.
| Type of model | Slow response | Fast response |
| Low accuracy | AI model with smaller parameters Quantized model can be used | AI model with smaller parameters Quantized model can be used |
| High accuracy | AI model with larger parameters. In general, models using FP32 results in better accuracy compared with others such as Int8, Int16, FP16 | AI model with larger parameters Depending on the available hardware resources, Quantized memory may be used with some compromises in the accuracy. |
| System requirement | Slow response | Fast response |
| Low accuracy | Lower end application processor or even micro controller with embedded NPU can be used. Memory requirement is relatively low. | Higher-end processor with NPU is recommended. Memory requirement is relatively low. |
| High accuracy | A processor that can handle FP32 model could be the best fit. Memory requirement is high. | A powerful processor with GPU or high-performance NPU is required. Memory requirement is high. |
When looking at running AI inference on a device, some compromises may need to be made due to other factors such as thermal design, cost target, power consumption and so on.
TOPS (Tera Operations Per Second): This metric is used how many operations can be done in a second. In general, a processor with higher TOPS gives you a better performance. However, this is not only the parameter. TOPS is generally obtained with int8 model. In some cases, TOPS value for sparse model is reported, but sometimes, it is for a dense model.
IPS (Inferences Per Second): This metric is a more accurate representation of AI performance with a given model compared with TOPS. It is often that IPS varies by processors even though they both show a similar TOPS number. Some vendors may have a metric such as IPS/TOPS, which represent how an AI engine is efficient to handle the inference.
IPW (Inference Per Watt): When designing a device within a certain power constraint, IPW may be a good metric to compare a couple of options to see which solution can provide the best performance within a given power budget.
It's important to note that AI models for on-device AI inference may need to be quantized or converted to a format suitable for selected processor. Not all models can be quantized or converted. In that case, a model needs to be created based on the dataset you have.
To process AI inference on the device, the model must first be deployed to it. Next, it must be decided whether the model programmed onto the device will remain static or require periodic updates. In the latter case, it is necessary to determine how the model will be updated , either by physically accessing the device to load the new model or by using an over-the-air (OTA) update via the network.
In case the model needs to be periodically updated, network connectivity is one of the key features of your product. And if the device is a wirelessly connected device, the reliable connectivity is important.
If you plan to continue training the model in the cloud using data collected by the device, a stable connection must always be maintained. Otherwise, the training may not proceed as expected due to insufficient data being transmitted.
On-device AI inferencing is becoming increasingly popular, thanks to advancements in hardware capabilities and the availability of smaller models with improved accuracy.
On-device AI inferencing offers several advantages over cloud-based AI inferencing. However, its disadvantages must also be considered to select the most suitable architecture based on application requirements.
Silex has been working closely with key partners to provide guidance to customers who are exploring on-device AI embedded system development.