Which architecture is the best for your edge AI device?

Written by Satoru Kumashiro | Nov 24, 2025 5:59:59 PM

Several options available to enable an edge AI device

As mentioned in our previous article, (Why On-Device AI Is the Future of Inference), AI inferencing on edge devices is getting popular thanks to a variety of products becoming available from many vendors.

On the other hand, because there are many options available, it could become more complex to determine the product architecture and components meeting the requirements. For example,

AI inference on a SoC with an integrated NPU/accelerator
AI inference on SoC with an integrated GPU/accelerator
AI inference on a discrete AI accelerator
AI inference on a discrete GPU

The goal of this article is to clarify how this article can assist developers in defining product architecture.

1. AI inference on SoC with an integrated NPU/accelerator

The architecture described below is primarily intended for compact embedded devices that need to be cost-effective.

In general, an integrated NPU uses int8, int16 models. Some advanced SoCs can take FP16 models for better accuracy.
A light weight AI model should be considered as the memory available in the embedded device may be limited.
This is the choice when developing compact form-factor embedded device as this architecture contains necessary functional block to run AI inference.
Even some advanced microcontroller starts integrating NPU. Depending on the accuracy and performance requirement, there are a variety of choices.
This architecture is a primary choice when AI inference is handled at each device.

2. AI inference on SoC with an integrated GPU/accelerator

When your application requires high accuracy AI inferencing, but still needs to be in a certain form-factor, a SoC with embedded GPU with AI accelerator is a viable option.

The GPU with the accelerator can handle FP32 model.
There are multiple options for the memory configuration ranging from 4GB RAM to 128GB RAM.
A light weight AI model for a system with a smaller RAM
Large AI model can be handled by a system with a larger RAM
The power consumption of this architecture is generally higher than option #1 mentioned above
This architecture is a primary choice when AI inference is done on an edge computer acting as a centralized unit for multiple devices.

3. AI inference on a discrete AI accelerator

This architecture can be a stepping stone to retrofit an existing hardware, or as part of a development strategy to enable multiple products using the same main SoC.
A discrete AI accelerator is often pluggable on either USB or PCI express interface on M.2 or mini-PCIe slot or other industry standard connector.
Higher discrete AI accelerators also can support FP32 models which typical results in higher accuracy.
As it is a discrete component used in conjunction with a main processor, it will take more space on the embedded device.
It can allow developers to consider modular development strategy and support higher AI processing capability, when using high-end accelerator, compared with a NPU/accelerator integrated in lower performance SoCs.
This architecture can be an alternative to option #1 mentioned above.

4. AI inference on a discrete GPU

This architecture is primarily for a large computing system requiring high performance graphics processing and AI performance.
It can be solder down GPU(s) or a graphic board depending on the system design.
GPU vendors are offering their SDKs optimized for AI processing.
This architecture can provide more scalability to enhance the performance by adopting parallel computing by GPU.
This architecture will provide > 1,000 TOPS with int8 sparse model.
This architecture can be an option when the system needs to be x86_64 architecture, but needs high AI performance.

Summary

Architecture	Use Case	Model Support	Memory Range	Power Consumption	Form Factor	Scalability
1. SoC with Integrated NPU/Accelerator	Compact embedded devices with cost efficiency	int8, int16 FP16 in some models	Limited (embedded memory)	Low	Very compact	Limited, device-level
2. SoC with Integrated GPU/Accelerator	Edge computers requiring higher accuracy in moderate form factor	FP32 and others	4GB–128GB RAM	Moderate to High	Compact to mid-size	Moderate, centralized edge
3. Discrete AI Accelerator	Modular upgrades or shared SoC across products	int8, int16 FP32 in some models	Depends on host system	Moderate	Larger footprint (USB/PCIe)	High, modular and flexible
4. Discrete GPU	High-performance systems with intensive AI and graphics needs	FP32 and others, int8 sparse (>1000 TOPS)	Large (system dependent)	High	Large (soldered or GPU board)	Very high, supports parallelism

View full post