Hardware Conversion of Convolutional Neural Networks: What is Machine Learning?

Collect

summary

This three-part series explores the characteristics and applications of convolutional neural networks (CNNs). CNNs are primarily used for pattern recognition and object classification. As the third in a series of articles, this article focuses on explaining how to convert convolutional neural networks (CNNs) using hardware, and specifically introduces the benefits of using artificial intelligence (AI) microcontrollers with CNN hardware accelerators to implement AI applications at the edge of the Internet of Things (IoT). The first two articles in the series are "Introduction to Convolutional Neural Networks: What is Machine Learning? - Part 1" and "Training Convolutional Neural Networks: What is Machine Learning? - Part 2".

Introduction

AI applications are typically energy-intensive and run on server farms or expensive field-programmable gate arrays (FPGAs). The challenge for AI applications is to increase computing power while keeping power consumption and cost low. Powerful intelligent edge computing is transforming AI applications. Compared to traditional firmware-based AI computing, intelligent edge AI computing based on hardware-based convolutional neural network accelerators has amazing speed and powerful computing power, ushering in a new era of computing performance. This is because intelligent edge computing enables sensor nodes to make decisions locally without being limited by the data transmission rate of 5G and Wi-Fi networks, providing support for the realization of emerging technologies and application scenarios that were previously difficult to implement. For example, in remote areas, sensor-level smoke/fire detection or environmental data analysis has become a reality. These applications support battery power and can work for many years. This article explains how to achieve these functions by exploring how to implement hardware conversion of CNN using an AI microcontroller with a dedicated CNN accelerator.

AI microcontroller with ultra-low power convolutional neural network accelerator

MAX78000 is an AI microcontroller system-on-chip with an ultra-low-power CNN accelerator that can achieve ultra-low-power neural network computing in resource-constrained edge devices or IoT applications. Its application scenarios include target detection and classification, audio processing, sound classification, noise cancellation, facial recognition, time series data processing based on health sign analysis such as heart rate, multi-sensor analysis, and predictive maintenance.

Figure 1 shows the block diagram of the MAX78000, which has an Arm® Cortex®-M4F core with a floating-point unit and operates at up to 100 MHz. To provide sufficient storage resources for applications, the MAX78000 is also equipped with 512 kB of flash and 128 kB of SRAM. The device provides multiple external interfaces such as I²C, SPI, UART, and I²S for audio. In addition, the device integrates a 60 MHz RISC-V core that can act as an intelligent direct memory access (DMA) engine to copy/paste data from/to various peripheral modules and storage (including flash and SRAM). Since the RISC-V core can perform the AI accelerator required

Figure 1. MAX78000 block diagram

The sensor data is pre-processed so that the Arm core can be in deep sleep mode during this time. The inference results can also trigger the Arm core to perform actions in the main application through interrupts, transmit sensor data wirelessly, or send notifications to the user.

具备用于执行卷积神经网络推理的专用硬件加速器单元是MAX7800x系列微控制器的一个显著特征，这使其有别于标准的微控制器架构。该CNN硬件加速器可以支持完整的CNN模型架构以及所有必需的参数（权重和偏置），配备了64个并行处理器和一个集成存储器。集成存储器中的442 kB用于存储参数，896 kB用于存储输入数据。不仅存储在SRAM中的模型和参数可以通过固件进行调整，网络也可以实时地通过固件进行调整。器件支持的模型权重为1位、2位、4位或8位，存储器支持容纳多达350万个参数。加速器的存储功能使得微控制器无需在连续的数学运算中每次都要通过总线获取相关参数——这样的方式通常伴有高延迟和高功耗，代价高昂。CNN加速器可以支持32层或64层的网络，具体层数取决于池化函数。每层的可编程图像输入/输出大小最多为1024 × 1024像素。

CNN Hardware Conversion: Power Consumption and Inference Speed Comparison

CNN inference is a complex computational task involving large matrix linear equation operations. The powerful capabilities of the Arm Cortex-M4F microcontroller allow CNN inference to be run on the firmware of the embedded system. However, this approach also has some disadvantages: when running firmware-based CNN inference on a microcontroller, the calculation commands and related parameters need to be retrieved from the memory first and then written back to the intermediate results, which will cause a lot of power consumption and latency.

Table 1 compares the CNN inference speed and power consumption of three different solutions. The model used is developed based on the handwritten digit recognition training set MNIST, which can classify numbers and letters in visual input data to obtain accurate output results. To determine the difference in power consumption and speed, this article measures the inference time required for the three solutions.

Table 1. CNN inference time and power consumption for handwritten digit recognition, based on the MNIST dataset

Solution 1 uses the MAX32630 with an integrated Arm Cortex-M4F processor for inference, which operates at 96 MHz. Solution 2 uses the MAX78000's CNN hardware accelerator for inference, which is 400 times faster than Solution 1 (i.e., the time between data input and result output), and the energy required for each inference is only 1/1100 of that of Solution 1. Solution 3 optimizes the MNIST network for low power consumption, thereby minimizing the power consumption of each inference. Although the accuracy of the inference results of Solution 3 dropped from 99.6% to 95.6%, it is much faster, taking only 0.36 ms per inference, and the inference power consumption is reduced to only 1.1 µW. Two AA alkaline batteries (a total of 6 Wh of energy) can support the application for 5 million inferences (ignoring the power consumption of other parts of the system).

These data illustrate how the powerful computing capabilities of hardware accelerators can greatly benefit application scenarios that cannot use or be connected to a continuous power source. The MAX78000 is one such product that supports edge AI processing without the need for large amounts of power and network connectivity, or lengthy inference times.

Examples of using the MAX78000 AI microcontroller

The MAX78000 supports a variety of applications, and this article discusses some of these use cases. One use case is designing a battery-powered camera that needs to be able to detect if a cat appears in the field of view and be able to open the cat door through a digital output to allow the cat to enter the house.

Figure 2 shows an example block diagram of this design. In this design, the RISC-V core periodically turns on the image sensor and loads image data into the CNN accelerator of the MAX78000. If the system determines that the probability of a cat appearing is higher than a preset threshold, the cat door is opened and the system returns to standby mode.

Figure 2. Smart pet door frame diagram

Development Environment and Evaluation Kits

The development process of edge AI applications can be divided into the following stages:

Phase 1: AI - Network definition, training, and quantization

Stage 2: Arm firmware - Import the network and parameters generated in stage 1 into a C/C++ application, create and test the firmware

The first stage of the development process involves modeling, training, and evaluating AI models. Developers can use open source tools such as PyTorch and TensorFlow at this stage. The MAX78000 GitHub page also provides comprehensive resources to help users build and train AI networks using PyTorch while considering its hardware specifications. The page also provides some simple AI networks and applications, such as facial recognition (Face ID), for user reference.

Figure 3 shows a typical process for AI development using PyTorch. The first step is to model the network. It is important to note that not all MAX7800x microcontrollers are equipped with the relevant hardware to support all PyTorch data operations. Therefore, you must first include the ai8x.py file provided by Analog Devices in the project, which contains the PyTorch modules and operators required for the MAX78000. Based on this, you can proceed to the next step of building the network, training it with training data, evaluating it, and quantizing it. This step generates a checkpoint file that contains the input data for the final synthesis process. The last step is to convert the network and its parameters into a form suitable for the CNN hardware accelerator. It is worth noting that although any PC (laptop, server, etc.) can be used to train the network, it may take a long time to train the network without a CUDA graphics card - even for small networks, it may take days or even weeks.

The second stage of the development process is to create the application firmware with a mechanism to write data to the CNN accelerator and read the results.

Figure 3. AI development process

The files created in the first stage are integrated into the C/C++ project through the #include directive. The development environment for the microcontroller can use open source tools such as Eclipse IDE and GNU tool chain. The software development kit (Maxim Micros SDK (Windows)) provided by ADI also contains all the components and configurations required for development, including peripheral drivers and example instructions, to help users simplify the application development process.

Projects that successfully compile and link can be evaluated on the target hardware. ADI has developed two different hardware platforms to choose from: Figure 4 shows the MAX78000EVKIT, and Figure 5 shows the MAX78000FTHR, a slightly smaller evaluation board. Each evaluation board is equipped with a VGA camera and a microphone.

[1] [2]

Reference address：Hardware Conversion of Convolutional Neural Networks: What is Machine Learning? - Part 3

Previous article：ADALM2000 Experiments: CMOS Logic Circuits, D-Type Latch
Next article：ADALM2000 Experiment: BJT Multivibrator

Popular Resources
Popular amplifiers