Platform and SoM Knowledge Pool

Which architecture is the best for your edge AI device?

Written by Satoru Kumashiro | Nov 24, 2025 5:59:59 PM

Several options available to enable an edge AI device

As mentioned in our previous article, (Why On-Device AI Is the Future of Inference), AI inferencing on edge devices is getting popular thanks to a variety of products becoming available from many vendors.

On the other hand, because there are many options available, it could become more complex to determine the product architecture and components meeting the requirements. For example,

  • AI inference on a SoC with an integrated NPU/accelerator
  • AI inference on SoC with an integrated GPU/accelerator
  • AI inference on a discrete AI accelerator
  • AI inference on a discrete GPU

The goal of this article is to clarify how this article can assist developers in defining product architecture.

1. AI inference on SoC with an integrated NPU/accelerator

The architecture  described below is  primarily intended for compact embedded devices that need to be cost-effective.

  • In general, an integrated NPU uses int8, int16 models. Some advanced SoCs can take FP16 models for better accuracy.
  • A light weight AI model should be considered as the memory available in the embedded device may be limited.
  • This is the choice when developing compact form-factor embedded device as this architecture contains necessary functional block to run AI inference.
  • Even some advanced microcontroller starts integrating NPU. Depending on the accuracy and performance requirement, there are a variety of choices.
  • This architecture is a primary choice when AI inference is handled at each device.

2. AI inference on SoC with an integrated GPU/accelerator

When your application requires high accuracy AI inferencing, but still needs to be in a certain form-factor, a SoC with embedded GPU with AI accelerator is a viable option.

  • The GPU with the accelerator can handle FP32 model.
  • There are multiple options for the memory configuration ranging from 4GB RAM to 128GB RAM.
  • A light weight AI model for a system with a smaller RAM
  • Large AI model can be handled by a system with a larger RAM
  • The power consumption of this architecture is generally higher than option #1 mentioned above
  • This architecture is a primary choice when AI inference is done on an edge computer acting as a centralized unit for multiple devices.

3. AI inference on a discrete AI accelerator

  • This architecture can be a stepping stone to retrofit an existing hardware, or as part of a development strategy to enable multiple products using the same main SoC.
  • A discrete AI accelerator is often pluggable on either USB or PCI express interface on M.2 or mini-PCIe slot or other industry standard connector.
  • Higher discrete AI accelerators also can support FP32 models which typical results in higher accuracy.
  • As it is a discrete component used in conjunction with a main processor, it will take more space on the embedded device.
  • It can allow developers to consider modular development strategy and support higher AI processing capability, when using high-end accelerator, compared with a NPU/accelerator integrated in lower performance SoCs.
  • This architecture can be an alternative to option #1 mentioned above.

4. AI inference on a discrete GPU

  • This architecture is primarily for a large computing system requiring high performance graphics processing and AI performance.
  • It can be solder down GPU(s) or a graphic board depending on the system design.
  • GPU vendors are offering their SDKs optimized for AI processing.
  • This architecture can provide more scalability to enhance the performance by adopting parallel computing by GPU.
  • This architecture will provide >  1,000 TOPS with int8 sparse model.
  • This architecture can be an option when the system needs to be x86_64 architecture, but needs high AI performance.

Summary

Architecture Use Case Model Support Memory Range Power Consumption Form Factor Scalability
1. SoC with Integrated NPU/Accelerator Compact embedded devices with cost efficiency

int8, int16

FP16 in some models

Limited (embedded memory) Low Very compact Limited, device-level
2. SoC with Integrated GPU/Accelerator Edge computers requiring higher accuracy in moderate form factor FP32 and others 4GB–128GB RAM Moderate to High Compact to mid-size Moderate, centralized edge
3. Discrete AI Accelerator Modular upgrades or shared SoC across products

int8, int16

FP32 in some models

Depends on host system Moderate Larger footprint (USB/PCIe) High, modular and flexible
4. Discrete GPU High-performance systems with intensive AI and graphics needs

FP32 and others,

int8 sparse (>1000 TOPS)

Large (system dependent) High Large (soldered or GPU board) Very high, supports parallelism