The edge NPU track is getting more and more crowded-EEWORLD

Collect

As big models inject new vitality into AI, the demand for edge AI is also increasing. Several major processor IP manufacturers are expanding edge-based AI NPUs to offload the CPU load, thereby improving efficiency and reducing power consumption.

Recently, Ceva announced the launch of Ceva-NeuPro-Nano NPU, expanding its Ceva-NeuPro Edge AI NPU product line.

TinyML (Tiny Machine Learning) is a technology for running machine learning models on resource-constrained microcontrollers and edge devices. The goal of TinyML is to implement efficient machine learning algorithms on low-power, low-memory, and low-computing resource devices to support real-time data processing and decision-making.

The increasing demand for efficient and professional AI solutions in IoT devices has driven the rapid growth of the TinyML market. According to a forecast by ABI Research, by 2030, more than 40% of TinyML shipments will use dedicated TinyML hardware rather than being driven by general-purpose MCUs.

What does TinyML entail?

TinyML has the following four characteristics:

Ultra-low power: Suitable for battery-powered or energy-harvesting devices, with power consumption typically in the milliwatt range.

Small memory footprint: Usually runs on microcontrollers with only a few KB to a few hundred KB of RAM and Flash.

Real-time processing: Supports real-time data processing and response, suitable for Internet of Things (IoT) devices.

Embedded applications: widely used in smart home, wearable devices, industrial Internet of Things and other fields.

The above four characteristics are also challenges for TinyML, including resource constraints, power consumption management, real-time requirements, and model accuracy.

In order to execute AI in edge MCU, more efficient algorithm design is needed, and the model must be compressed and quantized.

For example, design lightweight algorithms suitable for embedded systems, such as small convolutional neural networks (CNNs), recurrent neural networks (RNNs), etc. And use simple activation functions and efficient arithmetic operations to reduce computational overhead.

As for model compression and quantization, a variety of solutions can be used, including model pruning: removing unimportant parameters in the model to reduce computing and storage requirements; weight sharing: reducing the number of model parameters through weight sharing; quantization: converting floating-point numbers to fixed-point numbers, such as 8-bit or 16-bit integers, to reduce memory and computing requirements.

Currently, there are several ways to process TinyML. One is to accelerate it using a low-power computing unit on the MCU, such as Arm's Cortex-M. Another way is to offload the CPU load through a dedicated hardware accelerator. Several major IP suppliers have launched dedicated TinyML accelerator IPs. At the same time, some MCU manufacturers have also developed their own NPU/DSP or other similar accelerators.

The typical development process includes early data collection and preprocessing, followed by model training and optimization in a high-performance computing system, and then using model compression and quantization techniques to convert the final pruned model into a format that can be used by embedded processors before deployment.

Current mainstream TinyML IP suppliers

In fact, several major CPU IP suppliers provide NPU IP, including Arm, Cadence, Synopsys, Verisilicon, Ceva, etc. Some are suitable for MCUs, while others are suitable for SoCs, based on constraints such as processing performance, power consumption, area, and cost.

Arm’s Ethos

The Arm Ethos-U65 is an advanced micro neural processing unit (microNPU) designed for AI solutions in embedded devices. It inherits the high energy efficiency of the Ethos-U55 and doubles the performance for Arm Cortex-A, Cortex-R and Neoverse systems.

Key features of the Ethos-U65 include:

Excellent performance and energy efficiency: Achieve 1 TOP/s in 16nm process while achieving 2x performance improvement in the smallest area.

Flexible integration: Supports a wide range of operating systems and DRAM, and is suitable for BareMetal or RTOS systems on Cortex-M.

Support for complex AI models: Handle complex workloads, especially those that require extensive AXI interfaces and DRAM support, with performance improvements of up to 150%.

Energy efficient: ML workloads consume up to 90% less energy than previous Cortex-M generations.

Future-proof: Supports heavy computational operators such as convolution, LSTM, RNN, and automatically runs other cores on Cortex-M.

Offline Optimization: Improve performance and reduce system memory requirements by up to 90% by compiling and optimizing neural networks offline.

The multiple functions of Ethos-U65 enable it to meet the needs of various high-performance and low-power embedded devices, such as smart cameras, environmental sensors, industrial automation, and mobile devices. It provides a unified tool chain for developing, deploying, and debugging AI applications, which is a powerful boost to innovation.

Verisilicon

VeriSilicon's Vivante VIP9000 processor family provides programmable and scalable solutions for real-time and low-power AI devices. Its patented neural network engine and tensor processing structure provide excellent inference performance while having industry-leading power consumption and area efficiency. The VIP9000 series supports a performance range from 0.5TOPS to 20TOPS and is suitable for a wide range of applications from wearable devices, IoT, smart homes to automobiles and edge servers.

VIP9000 supports all popular deep learning frameworks and achieves acceleration through technologies such as quantization, pruning, and model compression. Its programmable engine and tensor processing structure support a variety of data types and processing tasks. Through ACUITY Tools SDK and various runtime frameworks, AI applications can be easily ported to the VIP9000 platform for efficient development and deployment.

Something

Ceva-NeuPro-Nano is a highly efficient, self-contained edge NPU designed for TinyML applications for AIoT devices. Its performance ranges from 10 GOPS to 200 GOPS, supporting always-on applications for battery-powered devices such as hearables, wearables, home audio, smart home, and smart factory. It can run independently without the need for a main CPU/DSP, including code execution and memory management. It supports 4, 8, 16, and 32-bit data types, with native Transformer calculations, sparsity acceleration, and fast quantization. With Ceva-NetSqueeze technology, memory usage is reduced by 80%. Ceva NeuPro-Studio AI SDK is provided, which works seamlessly with open source AI inference frameworks such as TFLM and µTVM, covering voice, vision, and sensing use cases. Two configurations, Ceva-NPN32 and Ceva-NPN64, meet a wide range of application needs, providing optimal power efficiency and small silicon area.

Cadence

Cadence's Tensilica Neo NPU is a high-performance, low-power neural processing unit (NPU) designed for embedded AI applications. It provides high-performance AI processing capabilities for a variety of applications from sensors, audio, voice/speech recognition, vision, radar, etc. The Neo NPU is highly scalable, with single-core performance ranging from 256 to 32k 8x8-bit MAC per cycle, up to 80 TOPS, and can be further improved through multi-core configuration to meet the needs of ultra-low power IoT devices to high-performance AR/VR and automotive systems.

Neo NPU supports data types such as Int4, Int8, Int16, and FP16, and has mixed-precision computing capabilities, optimizing the balance between performance and accuracy. Its architecture supports a variety of neural network topologies, including classic and generative AI networks, and can offload the burden of the main processor. The built-in compression/decompression function effectively reduces system memory usage and bandwidth consumption.

Neo NPU can run at a typical clock frequency of up to 1.25GHz, providing excellent computing performance in a 7nm process. It integrates Cadence's NeuroWeave SDK, supports a unified software development environment, simplifies model deployment and optimization processes, and provides efficient and flexible AI solutions to meet the needs of a variety of embedded AI applications.

Synopsys

The Synopsys ARC NPX6 NPU IP family is the industry's highest-performance neural processing unit (NPU) IP, designed to meet the real-time computing needs of AI applications with ultra-low power consumption. The family includes ARC NPX6 and NPX6FS, supports the latest complex neural network models, including generative AI, and provides up to 3,500 TOPS of performance for intelligent SoC designs.

A single instance of the ARC NPX6 NPU IP can provide up to 250 TOPS of performance in a 5nm process, which can be increased to 440 TOPS through sparse features. After integrating multiple NPU instances, the performance can reach 3,500 TOPS. ARC NPX6 supports from 1K to 96K MACs and is compatible with CNN, RNN/LSTM and emerging networks such as Transformer. It supports INT 4/8/16-bit resolution, and optional BF16 and FP16.

The ARC NPX6FS NPU IP is designed for functional safety and meets ISO 26262 ASIL D standards for automotive and other safety-critical applications. It features dual-core lockstep processors and self-checking safety monitoring to meet mixed criticality and virtualization requirements.

The ARC MetaWare MX Development Toolkit provided by Synopsys includes a compiler, debugger, neural network software development kit (SDK), virtual platform SDK, runtime and library, and advanced simulation models. The toolkit can automatically divide the algorithm into MAC resources for efficient processing and simplify the development process.

[1] [2]

Keywords：NPU Reference address：The edge NPU track is getting more and more crowded

Previous article：ST to showcase three products that enhance the human experience at MWC Shanghai 2024
Next article：ST Edge AI Suite artificial intelligence development kit is officially launched

Recommended ReadingLatest update time:2024-11-16 09:50

ST unveils more details on smart sensors with built-in NPU

At the SENSOR + TEST 2022 conference held in Nuremberg, Germany last month, STMicroelectronics detailed the ISM330IS, the industry's first sensor equipped with an intelligent sensor processing unit (ISPU). ST announced the technology in early 2022. The ISPU is a C-programmable embedded digital signal processor (DSP) c

[sensor]

New cooperation alliance: IAR and Edge Impulse join forces to provide AI and ML integration capabilities to global customers

Edge Impulse's advanced technology has been successfully integrated with the market-leading development solution IAR Embedded Workbench, helping global embedded developers integrate ML and AI into their workflows Shanghai, China – October 26 , 2023 – IAR , a global leader in embedded development sof

[Embedded]

Popular Resources
Popular amplifiers