STM32's "GPU" - DMA2D example detailed explanation

Latest update time：2021-07-28

Reads：

Source: This article was written by RT-Thread community member 梦芊 . Click the end of the article to read the original text to view the source.

Preface

GPU, or graphics processing unit, is the core of modern graphics cards. In the era without GPU, all graphics drawing was done by CPU, which needed to calculate the border, color and other data of the graphics, and was responsible for writing the data to the video memory. Simple graphics were not a problem, but with the development of computers (especially the development of games), the graphics and images that needed to be displayed became more and more complex, and the CPU became increasingly unable to cope. So later, GPU came into being, saving the CPU from the heavy graphics calculation tasks and greatly accelerating the display speed of graphics.

The MCU has a similar development history. In the early use scenarios of MCU, there was rarely a need for graphic display. Even if there was, it was just a simple display device such as 12864, which did not require much computing and could be handled well by the MCU's CPU. However, with the development of embedded graphics, MCUs have to undertake more and more graphics computing and display tasks, and the display resolution and color of embedded systems have also soared. Gradually, the MCU's CPU began to be unable to cope with these calculations. Therefore, starting with STM32F429, a GPU-like peripheral began to be added to the STM32 MCU. ST calls it Chrom-ART Accelerator, also called DMA2D (this article will use this name). DMA2D can provide acceleration in many 2D drawing situations, perfectly fitting the function of "GPU" in modern graphics cards.

Although this "GPU" can only provide 2D acceleration and its functions are very simple, it is not comparable to the GPU in the PC. However, it can already meet the graphics display acceleration requirements in most embedded developments. As long as DMA2D is used well, we can also make smooth and gorgeous UI effects on the microcontroller.

This article will start with examples to introduce the role that DMA2D can play in embedded graphics development. The purpose is to enable readers to easily and quickly establish the most basic concepts of DAM2D and learn the most basic usage. In order to prevent the content from being too obscure and difficult to understand, this article will not conduct an in-depth analysis of the advanced functions and features of DMA2D (such as a detailed introduction to the architecture of DMA2D, all registers, etc.). If you need to learn DAM2D in more detail and professionally, you can refer to the "STM32H743 Chinese Programming Manual" after reading this article.

Before reading this article, you need to have a certain understanding of the TFT liquid crystal controller (LTDC) in STM32 and basic graphics knowledge (such as frame buffer, pixels, color format and other concepts).

In addition, in addition to ST, many other manufacturers' MCUs also have peripherals with similar functions (such as the PxP designed by NXP in the RT series), but these are not within the scope of this article. Interested friends can learn about them on their own.

Preparation

Hardware Preparation

You can use any STM32 development board with DMA2D peripherals to verify the examples in this article, such as STM32F429, STM32F746, STM32H750 and other MCU development boards. The development board used in this article is ART-Pi. ART-Pi is a development board officially produced by RT-Thread, which uses the powerful configuration of STM32H750XB+32MB SDRAM with a main frequency of up to 480MHz. In addition, it has an onboard debugger (ST-Link V2.1), which is very convenient to use and is particularly suitable for the verification of various technical solutions. It is the perfect hardware demonstration platform for this article.

The display can be any color TFT display. It is recommended to use a 16-bit or 24-bit color RGB interface display. This article uses a 3.5'' TFT LCD display with an interface of RGB666 and a resolution of 320x240 (QVGA). In LTDC, the color format used is RGB565.

Development environment preparation

The content and code presented in this article can be used in any development environment you like, such as RT-Thread Studio, MDK, IAR, etc.

Before starting the experiment in this article, you need a basic project that drives the LCD display with framebuffer technology. Before running all the codes in this article, you need to enable DMA2D in advance.

Enabling DMA2D can be achieved through this macro (enable once during hardware initialization):

1// 使用DMA2D之前一定要先使能DMA2D外设
2__HAL_RCC_DMA2D_CLK_ENABLE();

Introduction to DMA2D

Let's first take a look at how ST describes DMA2D

It may seem a bit obscure at first glance, but in fact, it has the following functions:

Color fill (rectangular area)
Image (memory) copy
Color format conversion (such as YCbCr to RGB or RGB888 to RGB565)
Alpha Blend

The first two are memory-based operations, while the last two are computationally accelerated operations. Among them, transparency blending and color format conversion can be performed together with image copying, which brings greater flexibility.

As you can see, ST's positioning of DMA2D is just like its name, which is a DMA enhanced for image processing. In the actual development process, we will find that the use of DMA2D is very similar to the traditional DMA controller. In some non-graphics processing occasions, DMA2D can even replace the traditional DMA.

It should be noted that there are slight differences between the DMA2D accelerators of ST's different product lines. For example, the DMA2D of the STM32F4 series MCU does not have the function of converting between ARGB and AGBR color formats. Therefore, when you need to use a certain function, it is best to check the programming manual to see if the required function is supported.

This article only introduces the common features of DMA2D on all platforms.

DMA2D working mode

Just like the traditional DMA has three working modes: peripheral to peripheral, peripheral to memory, and memory to peripheral, DMA2D as a DMA is also divided into the following four working modes:

Register to Memory
Memory to Memory
Memory to memory and performs pixel color format conversion
Memory to memory with support for pixel color format conversion and transparency blending

It can be seen that the first two modes start with simple memory operations, while the last two modes perform color format conversion and/or transparency blending as needed during memory copying.

DMA2D and HAL libraries

In most cases, using the HAL library can simplify code writing and improve portability. However, this is an exception when using DMA2D. The biggest problem with the HAL library is the number of nested layers and the inefficiency of various safety checks. When operating other peripherals, the efficiency loss of using the HAL library will not have much impact. However, for peripherals such as DMA2D that are used for calculation and acceleration, considering that related operations will be called multiple times within a screen drawing cycle, using the HAL library at this time will cause the acceleration efficiency of DMA2D to be seriously reduced.

Therefore, most of the time we will not use the related functions in the HAL library to operate DMA2D. For efficiency, we will directly operate the registers to maximize the acceleration effect.

Because we frequently change the working mode in most occasions when using DMA2D, the graphical configuration of DMA2D in CubeMX has lost its meaning.

DMA2D scene example

1. Color Fill

The following is a simple bar chart:

Let's think about how to draw it.

First, we need to fill the screen with white as the background of the pattern. This process cannot be ignored, otherwise the original pattern displayed on the screen will interfere with our subject. Then, the bar chart is actually composed of 4 blue rectangular blocks and a line segment, and the line segment can also be regarded as a special rectangle with a height of 1. Therefore, the drawing of this figure can be decomposed into a series of "rectangle filling" operations:

Fill a rectangle with white that is equal to the size of the screen
Fill the four data bars with blue
Fill a line segment with a height of 1 with black

The essence of drawing a rectangle of any size at any position in the canvas is to set the data of the corresponding pixel position in the memory area to the specified color. However, because the framebuffer is stored linearly in memory, unless the width of the rectangle coincides with the width of the display area, the address of the seemingly continuous rectangular area in memory is discontinuous.

The following figure shows a typical memory distribution. The numbers represent the memory address of each pixel in the frame buffer (the offset relative to the first address, ignoring the case where a pixel occupies multiple bytes). The blue area is the rectangle we want to fill. It can be seen that the memory address of the rectangular area is discontinuous.

This feature of the framebuffer prevents us from simply using efficient operations such as memset to fill a rectangular area. Usually, we use the following double loop to fill any rectangle, where xs and ys are the coordinates of the upper left corner of the rectangle on the screen, width and height represent the width and height of the rectangle, and color represents the color to be filled:

1for(int y = ys; y < ys + height; y++){
2    for(int x = xs; x < xs + width; x++){
3        framebuffer[y][x] = color;        
4    }
5}

Although the code is simple, when it is actually executed, a large number of CPU cycles are wasted on operations such as judgment, addressing, and self-increment, and the actual time spent writing to memory accounts for a small proportion. As a result, the efficiency will decrease.

This is where DMA2D's register-to-memory mode of operation comes into play. DMA2D can fill rectangular memory areas at extremely high speeds, even if these areas are not actually continuous in memory.

Still taking the situation demonstrated in this picture as an example, let's see how it is implemented:

First, because we are only filling memory and do not need to copy memory, we need to make DAM2D work in register to memory mode. This is achieved by setting the [17:16] bits of the CR register of DMA2D to 11. The code is as follows:

1DMA2D->CR = 0x00030000UL;

Then, we need to tell DAM2D the properties of the rectangle to be filled, such as where the starting address of the area is, how many pixels the width of the rectangle is, and how high the rectangle is.

The starting address of the area is the memory address of the first pixel in the upper left corner of the rectangular area (the address of the red pixel in the figure), which is managed by the OMAR register of DAM2D. The width and height of the rectangle are in pixels, which are managed by the upper 16 bits (width) and lower 16 bits (height) of the NLR register respectively. The specific code is as follows:

1DMA2D->OMAR = (uint32_t)(&framebuffer[y][x]); // 设置填充区域的起始像素内存地址
2DMA2D->NLR  = (uint32_t)(width << 16) | (uint16_t)height; // 设置矩形区域的宽高

Next, because the address of the rectangle in memory is not continuous, we need to tell DMA2D how many pixels to skip after filling a row of data (that is, the length of the yellow area in the figure). This value is managed by the OOR register. There is a simple way to calculate the number of pixels to skip, that is, the width of the display area minus the width of the rectangle. The specific implementation code is as follows:

1DMA2D->OOR = screenWidthPx - width; // 设置行偏移，即跳过的像素

Finally, we need to tell DAM2D what color you will use for filling and what the color format is. This is managed by the OCOLR and OPFCCR registers, where the color format is defined by the LTDC_PIXEL_FORMAT_XXX macro. The specific code is as follows:

1DMA2D->OCOLR   = color; // 设置填充使用的颜色
2DMA2D->OPFCCR  = pixelFormat; // 设置颜色格式，比如想设置成RGB565，就可以使用宏LTDC_PIXEL_FORMAT_RGB565

Now that everything is set up, DMA2D has acquired all the information needed to fill the rectangle. Next, we need to enable DMA2D transmission, which is achieved by setting bit 0 of the DMA2D CR register to 1:

1DMA2D->CR |= DMA2D_CR_START; // 开启DMA2D的数据传输，DMA2D_CR_START是一个宏，其值为0x01

After the DMA2D transmission starts, we just need to wait for it to complete. After the DMA2D transmission is completed, the 0th bit of the CR register will be automatically set to 0, so we can wait for the DMA2D transmission to complete through the following code:

1while (DMA2D->CR & DMA2D_CR_START) {} // 等待DMA2D传输完成

Tips0: If you use OS, you can enable DMA2D's transfer completion interrupt. Then we can create a semaphore and wait for it after starting the transfer, and then release the semaphore in the DMA2D's transfer completion interrupt service function. In this way, the CPU can do something else while DMA2D is working instead of waiting here.

Tips1: Of course, in actual execution, DMA2D fills memory so quickly that the OS switching task overhead is longer than this time, so even if we use the OS, we will still choose to wait:).

For the versatility of the function, the starting transfer address and row offset are calculated outside the function and passed in. The complete function code we extracted is as follows:

 1static inline void DMA2D_Fill( void * pDst, uint32_t width, uint32_t height, uint32_t lineOff, uint32_t pixelFormat,  uint32_t color) {
 2
 3    /* DMA2D配置 */  
 4    DMA2D->CR      = 0x00030000UL;                                  // 配置为寄存器到储存器模式
 5    DMA2D->OCOLR   = color;                                         // 设置填充使用的颜色，格式应该与设置的颜色格式相同
 6    DMA2D->OMAR    = (uint32_t)pDst;                                // 填充区域的起始内存地址
 7    DMA2D->OOR     = lineOff;                                       // 行偏移，即跳过的像素，注意是以像素为单位
 8    DMA2D->OPFCCR  = pixelFormat;                                   // 设置颜色格式
 9    DMA2D->NLR     = (uint32_t)(width << 16) | (uint16_t)height;    // 设置填充区域的宽和高，单位是像素
10
11    /* 启动传输 */
12    DMA2D->CR   |= DMA2D_CR_START;   
13
14    /* 等待DMA2D传输完成 */
15    while (DMA2D->CR & DMA2D_CR_START) {} 
16}

To facilitate code writing, we wrap a rectangle filling function for the screen coordinate system used:

1void FillRect(uint16_t x, uint16_t y, uint16_t w, uint16_t h, uint16_t color){
2    void* pDist = &(((uint16_t*)framebuffer)[y*320 + x]);
3    DMA2D_Fill(pDist, w, h, 320 - w, LTDC_PIXEL_FORMAT_RGB565, color);
4}

Finally, we try to use code to draw the example chart at the beginning of this section:

1  // 填充背景色
2  FillRect(0,   0,   320, 240,  0xFFFF);
3  // 绘制数据条
4  FillRect(80,  80,  20,  120,  0x001f);
5  FillRect(120, 100, 20,  100,  0x001f);
6  FillRect(160, 40,  20,  160,  0x001f);
7  FillRect(200, 60,  20,  140,  0x001f);
8  // 绘制X轴
9  FillRect(40,  200, 240, 1,    0x0000);

Code running effect:

2. Image display (memory copy)

Suppose we are developing a game and want to display a dancing flame on the screen. Usually, the artist draws each frame of the flame first and then puts it into the same picture material, as shown below:

Then we display each frame of the image in turn at a certain interval, and we can achieve the effect of "dancing flames" on the screen.

We will now skip the process of loading the source file into memory, assuming that the source image is already in memory. Then let's consider how to display one of the frames on the screen. Usually, we would do this by first calculating the address of each frame's data in memory, and then copying the data of this frame to the corresponding position in the framebuffer. The code is similar to this:

 1/**
 2 * 将素材中的一帧画面复制到framebuffer中的对应位置
 3 * index为画面在帧序列中的索引
 4 */
 5static void General_DisplayFrameAt(uint16_t index) {
 6    // 宏说明
 7    // #define FRAME_COUNTS     25  // 帧数量
 8    // #define TILE_WIDTH_PIXEL 96  // 每一帧画面的宽度（等于高度）
 9    // #define TILE_COUNT_ROW   5   // 素材中每一行有多少帧
10
11    // 计算帧起始地址
12    uint16_t *pStart = (uint16_t *) img_fireSequenceFrame;
13    pStart += (index / TILE_COUNT_ROW) * (TILE_WIDTH_PIXEL * TILE_WIDTH_PIXEL * TILE_COUNT_ROW);
14    pStart += (index % TILE_COUNT_ROW) * TILE_WIDTH_PIXEL;
15
16    // 计算素材地址偏移
17    uint32_t offlineSrc = (TILE_COUNT_ROW - 1) * TILE_WIDTH_PIXEL;
18    // 计算framebuffer地址偏移（320是屏幕宽度）
19    uint32_t offlineDist = 320 - TILE_WIDTH_PIXEL;
20
21    // 将数据复制到framebuffer
22    uint16_t* pFb = (uint16_t*) framebuffer;
23    for (int y = 0; y < TILE_WIDTH_PIXEL; y++) {
24        memcpy(pFb, pStart, TILE_WIDTH_PIXEL * sizeof(uint16_t));
25        pStart += offlineSrc + TILE_WIDTH_PIXEL;
26        pFb += offlineDist + TILE_WIDTH_PIXEL;
27    }
28}

It can be seen that a large number of memory copy operations are required to achieve this effect. In embedded systems, when a large amount of data needs to be copied, hardware DMA is the most efficient. However, hardware DMA can only move data with continuous addresses. Here, the addresses of the data to be copied in the source image and the frambuffer are not continuous, which leads to additional overhead (the same problem as in the first section), and also makes it impossible to use hardware DMA for efficient data copying.

So, although we achieved our goal, the efficiency was not high (or not the highest possible).

In order to move a piece of data from a material image to the frame buffer as quickly as possible, let's see how to use DMA2D to achieve this.

First, because this time we are going to copy data in the memory, we need to set the DMA2D working mode to "memory to memory mode", which is achieved by setting the [17:16] bits of the CR register of DMA2D to 00. The code is as follows:

1DMA2D->CR      = 0x00000000UL;

Then we need to set the memory addresses of the source and target separately. Unlike in the first section, because the data source also has a memory offset, we need to set the data offset of the source and target locations at the same time.

1DMA2D->FGMAR   = (uint32_t)pSrc; // 源地址
2DMA2D->OMAR    = (uint32_t)pDst; // 目标地址
3DMA2D->FGOR    = OffLineSrc;     // 源数据偏移（像素）
4DMA2D->OOR     = OffLineDst;     // 目标地址偏移（像素）

Then, you still need to set the width and height of the image to be copied, as well as the color format, which is the same as in the first section.

1DMA2D->FGPFCCR = pixelFormat;
2DMA2D->NLR     = (uint32_t)(xSize << 16) | (uint16_t)ySize;

In the same way, we start the DMA2D transfer and wait for the transfer to complete:

1/* 启动传输 */
2DMA2D->CR   |= DMA2D_CR_START;
3
4/* 等待DMA2D传输完成 */
5while (DMA2D->CR & DMA2D_CR_START) {}

Finally, the function we extracted is as follows:

 1static void DMA2D_MemCopy(uint32_t pixelFormat, void * pSrc, void * pDst, int xSize, int ySize, int OffLineSrc, int OffLineDst)
 2{
 3    /* DMA2D配置 */
 4    DMA2D->CR      = 0x00000000UL;
 5    DMA2D->FGMAR   = (uint32_t)pSrc;
 6    DMA2D->OMAR    = (uint32_t)pDst;
 7    DMA2D->FGOR    = OffLineSrc;
 8    DMA2D->OOR     = OffLineDst;
 9    DMA2D->FGPFCCR = pixelFormat;
10    DMA2D->NLR     = (uint32_t)(xSize << 16) | (uint16_t)ySize;
11
12    /* 启动传输 */
13    DMA2D->CR   |= DMA2D_CR_START;
14
15    /* 等待DMA2D传输完成 */
16    while (DMA2D->CR & DMA2D_CR_START) {}
17}

For convenience, we wrap a function that calls it:

 1static void DMA2D_DisplayFrameAt(uint16_t index){
 2
 3    uint16_t *pStart = (uint16_t *)img_fireSequenceFrame;
 4    pStart += (index / TILE_COUNT_ROW) * (TILE_WIDTH_PIXEL * TILE_WIDTH_PIXEL * TILE_COUNT_ROW);
 5    pStart += (index % TILE_COUNT_ROW) * TILE_WIDTH_PIXEL;
 6    uint32_t offlineSrc = (TILE_COUNT_ROW - 1) * TILE_WIDTH_PIXEL;
 7
 8
 9    DMA2D_MemCopy(LTDC_PIXEL_FORMAT_RGB565, (void*) pStart, pDist, TILE_WIDTH_PIXEL, TILE_WIDTH_PIXEL, offlineSrc, offlineDist);
10}

Then play each frame in turn. The frame interval is set to 50 milliseconds, and the target address is defined to the center of the frambuffer:

1while(1){
2    for(int i = 0; i < FRAME_COUNTS; i++){
3        DMA2D_DisplayFrameAt(i);
4        HAL_Delay(FRAME_TIME_INTERVAL);
5    }
6}

The final running effect:

3. Image gradient switching

Suppose we want to develop a picture viewing application. When switching between two pictures, direct switching will appear abrupt, so we need to add dynamic effects when switching. Gradient switching (fade in and fade out) is a very commonly used effect, and it looks good.

Let’s use these two pictures:

Here we need to understand the basic concept of alpha blending. First of all, alpha blending requires a foreground and a background. The result of the blending is equivalent to the effect of looking through the foreground to the background. If the foreground is completely opaque, then the background is completely invisible. On the contrary, if the foreground is completely transparent, then only the background can be seen. If the foreground is semi-transparent, the result is that the two are blended according to certain rules based on the transparency of the foreground color .

If 1 means completely transparent and 0 means opaque, the blending formula for transparency is as follows, where A is the background color and B is the foreground color:

1X(C)=(1-alpha)*X(B) + alpha*X(A)

Because color has three channels, RGB, we need to calculate all three channels and then combine them after the calculation is completed:

1R(C)=(1-alpha)*R(B) + alpha*R(A)
2G(C)=(1-alpha)*G(B) + alpha*G(A)
3B(C)=(1-alpha)*B(B) + alpha*B(A)

In the program, for the sake of efficiency (the CPU is very slow in floating point operations), we do not use values in the range of 0 to 1. Usually, we use an 8-bit value to represent transparency, ranging from 0 to 255. It should be noted that the larger the value, the less transparent it is. That is, 255 is completely opaque, and 0 is completely transparent (so it is also called opacity). Then we can get the final formula:

1outColor = ((int) (fgColor * alpha) + (int) (bgColor) * (256 - alpha)) >> 8;

Implement transparency blending code for RGB565 color format pixels:

 1typedef struct{
 2    uint16_t r:5;
 3    uint16_t g:6;
 4    uint16_t b:5;
 5}RGB565Struct;
 6
 7static inline uint16_t AlphaBlend_RGB565_8BPP(uint16_t fg, uint16_t bg, uint8_t alpha) {
 8    RGB565Struct *fgColor = (RGB565Struct*) (&fg);
 9    RGB565Struct *bgColor = (RGB565Struct*) (&bg);
10    RGB565Struct outColor;
11
12    outColor.r = ((int) (fgColor->r * alpha) + (int) (bgColor->r) * (256 - alpha)) >> 8;
13    outColor.g = ((int) (fgColor->g * alpha) + (int) (bgColor->g) * (256 - alpha)) >> 8;
14    outColor.b = ((int) (fgColor->b * alpha) + (int) (bgColor->b) * (256 - alpha)) >> 8;
15
16
17    return *((uint16_t*)&outColor); 
18}

After understanding the concept of transparency blending and implementing transparency blending of a single pixel, let's see how to implement gradient switching of images.

Assuming that the entire gradient is completed within 30 frames, we need to open a buffer in memory that is equal to the size of the picture. Then we use the first picture (the picture currently displayed) as the background and the second picture (the picture to be displayed next) as the foreground. Then we set a transparency for the foreground, blend the transparency of each pixel, and temporarily store the blended result in the buffer. After the blending is completed, the data in the buffer is copied to the framebuffer to complete the display of one frame. Next, continue with the second frame, the third frame, and so on, gradually increasing the opacity of the foreground until the foreground color becomes opaque, which means that the gradient switching of the picture is completed.

Because each frame needs to perform a blending operation on each pixel in the two pictures, which brings a huge amount of calculation. It is unwise to leave it to the CPU to implement, so we still leave these tasks to DMA2D to implement.

This time, the mixing function of DMA2D is used, so we need to enable the memory-to-memory mode with color mixing of DMA2D, and the corresponding value of the CR register [17:16] bit is 10, that is:

1DMA2D->CR    = 0x00020000UL;                // 设置工作模式为存储器到存储器并带颜色混合

Then set the memory address and data transmission offset of the foreground, background and output data, and the width and height of the transmitted image respectively:

1DMA2D->FGMAR = (uint32_t)pFg;               // 设置前景数据内存地址
2DMA2D->BGMAR = (uint32_t)pBg;               // 设置背景数据内存地址
3DMA2D->OMAR  = (uint32_t)pDst;              // 设置数据输出内存地址
4
5DMA2D->FGOR  = offlineFg;                   // 设置前景数据传输偏移
6DMA2D->BGOR  = offlineBg;                   // 设置背景数据传输偏移
7DMA2D->OOR   = offlineDist;                 // 设置数据输出传输偏移
8
9DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize; // 设置图像数据宽高（像素）

Set the color format. When setting the color format of the foreground color, you need to be careful, because if you use a color format like ARGB, then when we perform transparency blending, the alpha channel of the color data itself will affect the blending result, so we need to set it here to ignore the alpha channel of the foreground color itself when blending. And force the transparency of blending.

Output color format and background color format

1DMA2D->FGPFCCR = pixelFormat                // 设置前景色颜色格式
2        | (1UL << 16)                       // 忽略前景颜色数据中的Alpha通道
3        | ((uint32_t)opa << 24);            // 设置前景色不透明度
4
5DMA2D->BGPFCCR = pixelFormat;               // 设置背景颜色格式
6DMA2D->OPFCCR = pixelFormat;                // 设置输出颜色格式

Tips0: Sometimes we will encounter a situation where a picture with a transparent channel is superimposed on the background. At this time, the alpha channel of the color itself should not be disabled.

Tips1: In this mode, we can not only mix colors, but also convert color formats at the same time. We can set the foreground and background as well as the output color format as needed.

Finally, start the transfer:

1/* 启动传输 */
2DMA2D->CR   |= DMA2D_CR_START;
3
4/* 等待DMA2D传输完成 */
5while (DMA2D->CR & DMA2D_CR_START) {}

The complete code is as follows:

 1void _DMA2D_MixColors(void* pFg, void* pBg, void* pDst,
 2        uint32_t offlineFg, uint32_t offlineBg, uint32_t offlineDist,
 3        uint16_t xSize, uint16_t ySize,
 4        uint32_t pixelFormat, uint8_t opa) {
 5
 6    DMA2D->CR    = 0x00020000UL;                // 设置工作模式为存储器到存储器并带颜色混合
 7
 8    DMA2D->FGMAR = (uint32_t)pFg;               // 设置前景数据内存地址
 9    DMA2D->BGMAR = (uint32_t)pBg;               // 设置背景数据内存地址
10    DMA2D->OMAR  = (uint32_t)pDst;              // 设置数据输出内存地址
11
12    DMA2D->FGOR  = offlineFg;                   // 设置前景数据传输偏移
13    DMA2D->BGOR  = offlineBg;                   // 设置背景数据传输偏移
14    DMA2D->OOR   = offlineDist;                 // 设置数据输出传输偏移
15
16    DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize; // 设置图像数据宽高（像素）
17
18    DMA2D->FGPFCCR = pixelFormat                // 设置前景色颜色格式
19            | (1UL << 16)                       // 忽略前景颜色数据中的Alpha通道
20            | ((uint32_t)opa << 24);            // 设置前景色不透明度
21
22    DMA2D->BGPFCCR = pixelFormat;               // 设置背景颜色格式
23    DMA2D->OPFCCR  = pixelFormat;                // 设置输出颜色格式
24
25    /* 启动传输 */
26    DMA2D->CR   |= DMA2D_CR_START;
27
28    /* 等待DMA2D传输完成 */
29    while (DMA2D->CR & DMA2D_CR_START) {}
30}

Write the test code, this time there is no need for a secondary wrapper function:

 1void DMA2D_AlphaBlendDemo(){
 2
 3    const uint16_t lcdXSize = 320, lcdYSize = 240;
 4    const uint8_t cnvFrames = 60; // 60帧完成切换
 5    const uint32_t interval = 33; // 每秒30帧
 6    uint32_t time = 0;
 7
 8    // 计算输出位置的内存地址
 9    uint16_t distX = (lcdXSize - DEMO_IMG_WIDTH) / 2;
10    uint16_t distY = (lcdYSize - DEMO_IMG_HEIGHT) / 2;
11    uint16_t* pFb = (uint16_t*) framebuffer;
12    uint16_t* pDist = pFb + distX + distY * lcdYSize;
13    uint16_t offlineDist = lcdXSize - DEMO_IMG_WIDTH;
14
15    uint8_t nextImg = 1;
16    uint16_t opa = 0;
17    void* pFg = 0;
18    void* pBg = 0;
19    while(1){
20        // 切换前景/背景图片
21        if(nextImg){
22            pFg = (void*)img_cat;
23            pBg = (void*)img_fox;
24        }
25        else{
26            pFg = (void*)img_fox;
27            pBg = (void*)img_cat;
28        }
29
30        // 完成切换
31        for(int i = 0; i < cnvFrames; i++){
32            time = HAL_GetTick();
33            opa = 255 * i / (cnvFrames-1);
34            _DMA2D_MixColors(pFg, pBg, pDist,
35                    0,0,offlineDist,
36                    DEMO_IMG_WIDTH, DEMO_IMG_HEIGHT,
37                    LTDC_PIXEL_FORMAT_RGB565, opa);
38            time = HAL_GetTick() - time;
39            if(time < interval){
40                HAL_Delay(interval - time);
41            }
42        }
43        nextImg = !nextImg;
44        HAL_Delay(5000);
45    }
46}

final effect:

Performance comparison

In the previous section, we introduced three examples of embedded graphics development, and introduced the traditional and DMA2D implementation methods. At this time, some friends will definitely ask, how much faster can DMA2D implementation be compared to traditional methods? Let's actually test it.

The common test conditions are as follows:

Framebuffer is placed in SDRAM, 320x240, RGB565
SDRAM operating frequency is 100MHz, CL2, and 16-bit bandwidth.
MCU is STM32H750XB, main frequency 400MHz, I-Cache and D-Cache are enabled
The code and resources are on internal Flash, 64-bit AXI bus, with a speed of 200MHz.
GCC compiler (version: arm-atollic-eabi-gcc-6.3.1)

Test item: Rectangle filling

Draw the chart in Section 1 of the previous chapter 10,000 times and count the results

Test Results:

Test item: memory copy

Draw 10,000 frames of the sequence frame in Section 2 of the previous chapter and statistical results

Test Results:

Test item: Transparency blending

Gradually switch the two pictures in Section 3 of the previous chapter 100 times, 30 frames each time, a total of 3000 frames
The mixed result is directly output to the framebuffer, no longer buffered by the buffer

Test Results:

Performance Test Summary

From the above test results, we can see that DAM2D has at least two advantages:

First, it is faster: in some projects, the speed of DMA2D can be up to 30 times faster than that of pure software implementation! This is the result of testing on the STM32H750 platform with a main frequency of up to 400MHz and L1-Cache. If it is tested on the STM32F4 platform without cache and with a lower main frequency, the gap will be further widened.

Second, the performance is more stable: From the test results, we can see that the DMA2D implementation is very little affected by the compiler optimization level, which is almost negligible. This means that no matter you use IAR, GCC or MDK, you can achieve the same performance using DMA2D. There will not be a big difference in performance after the same code is transplanted.

In addition to these two intuitive results, there is actually a third advantage, which is that code writing is easier. DMA2D has few registers and is relatively intuitive. In some cases, it is much more convenient to use than software implementation.

Conclusion

The three examples in this article are all situations that I often encounter in embedded graphics development. In fact, there are many more uses of DMA2D. If you are interested, you can refer to the relevant content in the "STM32H743 Chinese Programming Manual". I believe that with the foundation of this article, you will get twice the result with half the effort when reading the content.

Due to the limitations of the author's skills, the content in the article cannot be 100% correct. If there are any errors, please point them out. Thank you.

Latest articles about

■Wow! Enter the application interface in 5.2 seconds! Linux quick boot solution sharing, based on Allwinner T113-i domestic platform

■CPU cache consistency: from theory to practice

■Throw some cold water on the cunning Hongmeng

■The process of receiving network data packets

■Let's talk about the current AI and a bunch of other things in plain language

■Vomiting blood sorting | Liver over Linux interrupt all knowledge points

■Introduction to Linux V4L2 subsystem and video codec equipment

■Arm64 stack backtrace

■Unbeatable! I strongly recommend taking the software exam this year!

■Domestic real-time operating system: real-time comparison with RT-Linux and Zephyr