This post was last edited by Hamster who doesn't like carrots on 2024-10-16 00:10
I have been studying ST AI these days and found a very interesting website. The link is as follows: https://stedgeai-dc.st.com/home .
In daily development, we need to train the model, run it on the development board, measure its performance, and then modify the model or adjust the parameters and run it again. It is cumbersome and requires a development board and a project. If we have nothing but the model, there is nothing we can do. So this website solves this problem. We can put the model on it, adjust the parameters and optimize it, and then run the model to measure the performance (this is not a software simulation. ST has a corresponding physical development board in the cloud, and you can run your model on the development board). Finally, you can also download the source code or firmware.
Next, I will show you how to use this website
1. Login
The first page you see when you enter the website is a welcome page. The pictures and three processes below well illustrate the functions of this website. We click "START NOW" (you will be asked to log in to your ST account later)
2. Select the model
The first step is to upload a model. You can choose to upload your own model or select one from ST Model Zoo.
Since I don't have a model yet, I choose one from the ST model library. There are three criteria for model selection: target board type, purpose, neural network
I will choose the gesture recognition model using the multi-point TOF chip VL53L5CX. Click "import" to use this model.
Click "start" to start using the model.
Next, the website will analyze the model, which takes about ten seconds. It is better not to be too fast.
3. Choose a platform
Here we need to select the version and chip platform of CUBE AI. I chose the latest 9.0.0 for CUBE AI (the CUBE AI installed in my CUBE MX is also this version, although it was already 9.1.0 when I wrote this article). Then you can choose the chip platform. Since I chose 9.0.0, the chip platform can only be STM32. However, we are using STM32 now, so it doesn’t matter if other platforms cannot be selected.
4. Quantitative Model
This step is optional. Quantizing the model can reduce the size of the neural network and save memory.
After quantization, the size of the model has become significantly smaller (of course, the accuracy will also be lost, which requires continuous parameter adjustment in the later stage to balance the size and performance)
If you do not use the quantization function, just click "Go next" to proceed to the next step. If you have quantized the model and want to use the quantized model, click "Select"
5. Optimization Model
Here you can optimize the model, choose balance, save memory, and improve inference speed. It is very simple and clear. The input and output buffer on the right is selected by default.
Click "Optimize" to start the optimization. Similarly, the speed is very fast and it only takes a few seconds.
The following will show the various performance parameters after optimization. I chose the balanced mode. You can see that both flash and RAM have decreased to a certain extent. Although not much, it is useful.
After the model is optimized, you can start benchmarking. Click “Go to benchmark”
6. Running points
Select a development board and start running. This process is not simulated by software. There is a real development board in the cloud. The cloud server generates the code and compiles it and downloads it to the development board. The result is obtained only after running it on the development board. So it takes a long time, about several minutes.
I happened to run into a queue when I was running. There were 2 people in total and I was the second one. I didn't expect this platform to be so popular.
Maybe there is something wrong with the website. I have waited for a long time but nothing happened. So I will skip this page for now.
VII. Results
The results of the calculation will be displayed in detail on the result page, and the historical results will also be here
8. Generate Project
Here you can generate C code, CUBEMX project, cubeIDE project, and ELF firmware. After downloading, we can run the code just run in the cloud on our own development board for retesting.
First select the development board, then select what you need (I choose cubeIDE project here), wait for a while to download
9. Run the test code on the local development board
After downloading, the code is in a compressed package and needs to be decompressed.
Double-click ".project" to open cubeide, compile Boot, download, compile APP, download. You can run the test code
The results will be output from the ST-Link virtual serial port. The output is as follows
[00:04:33.279]收←◆
#
# AI system performance 7.1
#
Compiled with GCC 12.3.1
STM32 device configuration...
Device : DevID:0x0485 (STM32H7[R,]Sxx) RevID:0x1003
Core Arch. : M7 - FPU used
HAL version : 0x01000000
SYSCLK clock : 600 MHz
HCLK clock : 300 MHz
FLASH conf. : ACR=0x00000036 - latency=6
CACHE conf. : $I/$D=(True,True)
[00:04:33.410]收←◆ Timestamp : SysTick + DWT (delay(1)=1.000 ms)
AI platform (API 1.1.0 - RUNTIME 9.0.0)
Discovering the network(s)...
Found network "network"
Creating the network "network"..
Initializing the network
Network informations...
model name : network
model signature : 0x1e108c42827f4c62598744246d259703
model datetime : Tue Oct 15 15:57:19 2024
compile datetime : Oct 16 2024 00:03:47
tools version : 9.0.0
complexity : 8520 MACC
c-nodes : 5
map_activations : 1
[0] @0x20000000/1024
map_weights : 6
[0] @0x70010680/576
[1] @0x70010660/32
[2] @0x7000E260/9216
[3] @0x7000E1E0/128
[4] @0x7000DDE0/1024
[5] @0x7000DDC0/32
n_inputs/n_outputs : 1/1
I[0] (1,8,8,2)128/float32 @0x20000200/512
O[0] (1,1,1,8)8/float32 @0x20000180/32
Running PerfTest on "network" with random inputs (16 iterations)...
................
Results for "network", 16 inferences @600MHz/300MHz (complexity: 8520 MACC)
duration : 0.143 ms (average)
CPU cycles : 86088 (average)
CPU Workload : 0% (duty cycle = 1s)
cycles/MACC : 10.10 (average for all layers)
used stack : 984 bytes
used heap : 0:0 0:0 (req:allocated,req:released) max=0 cur=0 (cfg=3)
observer res : 104 bytes used from the heap (5 c-nodes)
Inference time by c-node
kernel : 0.140ms (time passed in the c-kernel fcts)
user : 0.000ms (time passed in the user cb)
c_id type id time (ms)
---------------------------------------------------
0 OPTIMIZED_CONV2D 3 0.113 81.15 %
1 DENSE 6 0.020 14.35 %
2 NL 6 0.000 0.53 %
3 DENSE 7 0.002 2.00 %
4 NL 7 0.002 1.97 %
-------------------------------------------------
0.140 ms
It can be seen that this model only takes 0.140ms to run once on this development board with the above parameters. It is worthy of being called H7. The high main frequency is amazing.