Super streamlined series: efficient and fixed-running-time memory pool implementation

Latest update time：2024-06-20

Reads：

I. Introduction

Here, let's take a BUG that is easy to appear in the development of RTOS environment as an example to explain why we design the wheel of our memory pool. Taking FreeRTOS as an example, the implementation of heap management malloc and free is based on linked lists. Linked lists are global resources, and the operation is not an atomic operation, involving many steps. Therefore, critical section protection must be performed when multiple calls are made, otherwise the operation will be interrupted halfway and another operation will be performed, which will cause the resource to be destroyed. FreeRTOS 's processing method is to implement critical section protection by turning off scheduling. Turning off scheduling can only ensure that the task will not preempt, but cannot ensure interrupt preemption, so FreeRTOS 's malloc and free interfaces cannot be called in interrupts. Sometimes, driver design must use dynamic memory management. Many drivers are interrupt-driven. Sometimes the heap management interface must be called in the interrupt. At this time, directly calling the FreeRTOS heap management interface will cause problems. The above problems may be difficult to find and may not be tested. It may be at some coincidental time that the heap interface execution in the task is interrupted in the interrupt, and the heap management interface is called. At this time, problems will occur. This kind of problem is likely to have a lot of randomness, which leads to difficulties in debugging. Knowing this reason, similar errors can be avoided. But back to the question, what should we do if we still need to use dynamic memory in interrupts? The simplest way is to use the heap interface provided by FreeRTOS and change the shutdown scheduling to shutdown interrupts. This ensures that the critical section is not preempted and there will be no problem. However, there will be some side effects, which is the increase in the shutdown interrupt time, and this time is still uncertain. We will conduct actual tests in the next section. Of course, heap management also has problems such as fragmentation, so it is not suitable for calling at the driver layer and interrupts. So we also have another way, which is to use a memory pool. This is a common way for many middleware and drivers. This article is to implement a simple and efficient memory pool, but we are not satisfied with this. We have higher requirements. After all, it is used in the driver layer or even interrupts. We hope that it will execute very quickly and the execution time will be basically non-volatile, because these two indicators are very important and many systems must even meet them.

2. Heap management interface execution time test

There are many ways to test execution time, generally using hardware timers. Here we use another method that is also commonly used in practice, using IO flipping and viewing the waveform with an oscilloscope. When entering the interface execution, flip the IO , exit the interface execution, and flip the IO again . The pulse width of the IO waveform is the execution time. This can be visualized. The change in density can show the jitter of the execution time, and the pulse width can show the execution time. By comparing maloc and free using different IOs , we can also see the time distribution of application and release, etc., so this is a very good method.

I have measured a certain platform here, and the execution time of malloc and free of freertos is as follows (note that the execution time is different on different hardware platforms). It can be seen that the time jitter is large and the time is not short. 4.6uS is already very long. For example, this time may not be able to meet the requirements for timing analysis in interrupts.

Later, we compared the modified version with our efficient memory pool implementation and found that there was basically no jitter and the execution time was very short ( the execution and IO response time of the flip IO function itself have not been removed ) . From the assembly code, we can also see that there are very few instructions.

3. Memory pool implementation

Conventional design

Divide a block of space into n blocks of equal size, with a field in front to mark whether it is used, and the rest as valid storage space. Initially, all are free, and the mark is 0. When you need to apply, search from the beginning. If you find a block marked as 0 , you can apply to change the mark to 1. If you release it, directly match the address and set the corresponding mark to 0.

The above mark needs to occupy at least one byte, and will affect the alignment property of the valid block before being inserted into the valid block, so it can be optimized and the mark information can be recorded separately using a bitmap , as shown below:

Valid blocks can be aligned continuously according to the required original data structure. One bit in the bitmap corresponds to one block. 1 means the block has been allocated and 0 means it is free. The bitmap also reduces the space occupied. The application and release algorithm are similar to the original one. The application is just changed to search the position of 0 starting from the lowest bit of the lowest byte of the bitmap and set it to 1 to obtain the corresponding block address. The release is to calculate the bit index of the bitmap according to the address and set the corresponding bit to 0.

Highly efficient design without time jitter

The above bitmap implementation has been optimized in terms of storage, but there is jitter in the execution time because of the traversal search. In order to meet our time jitter-free design, the goal is to achieve a fixed algorithm time for finding the lowest byte with the lowest starting position of 0. In fact, many of our previous articles have mentioned the idea of contradiction, that is, looking at problems from the perspective of contradiction. To solve a problem, you can look at its opposite, because contradictions are often one that grows while the other grows. The time and space of the algorithm are contradictory. If time is required, then we will see if we can consider it from the perspective of space. A common method of exchanging space for time is table lookup. Table lookup has a fixed time and is also efficient. In the actual implementation of uSOSII , this algorithm is used to find the highest ready priority. We directly use it and briefly introduce its principle.

First, we arrange the bitmap in a matrix, and naturally think of bytes as columns, bits 0 to 7 correspond to columns 0 to 7 , and multiple bytes correspond to multiple rows. Each bit corresponds to a block, and a bit of 1 indicates free, and a bit of 0 indicates occupied.

Then the lowest bit of the lowest byte is 1 , which is the free block we are looking for. So how to find this bit quickly ? The general idea is to start from byte 0. If all are 0 , continue to look for byte 1 , find a byte that is not 0 , and then find the position of the first 1 from bit 0 to bit 7. Then the index of this bit in the entire bit is ( byte number * 8 + bit number ).

So the time can be decomposed into two steps, the first step is to find the byte position, the second step is to find the bit position.

In fact, we can quickly find the bit position by looking up the table .

There are 256 possibilities for 8- bit data . For each possibility, we just need to write out the position where the lowest 1 appears. In fact, we construct a 256- byte array table, and the value of the specified index is the number corresponding to this index, and the bit position where 1 first appears .

for example

00000000 has no 1 , so the lowest position where 1 appears first is 0 , so the position value of array index 1 is 0

00000001 The first position where 1 appears is bit 0 , so the value of the position at array index 1 is 0

00000010 The first position where 1 appears is bit1 , so the value of the position at array index 2 is 1

00000011 The first position where 1 appears is bit 0 , so the value of the position at array index 3 is 0

00000100 The first position where 1 appears is bit 2 , so the value of the position at array index 4 is 2

So finally construct the following table

static uint8_t  const  s_unmaptbl[256] = {    0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x00 to 0x0F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x10 to 0x1F                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x20 to 0x2F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x30 to 0x3F                             */    6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x40 to 0x4F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x50 to 0x5F                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x60 to 0x6F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x70 to 0x7F                             */    7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x80 to 0x8F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x90 to 0x9F                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xA0 to 0xAF                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xB0 to 0xBF                             */    6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xC0 to 0xCF                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xD0 to 0xDF                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xE0 to 0xEF                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0        /* 0xF0 to 0xFF                             */};

Now we can find the position of the first 1 in the byte by looking up the table in one step. The time is fixed. So we can start from byte 0 to find the first byte that is not 0 according to the previous introduction . Let's think about it carefully. In the previous byte, 8 bits per row were used to find the corresponding column of the first 1. Now, isn't it the corresponding row to find the byte?

So we first record whether each line is 0 , and then use a bitmap to record whether this line is 0. As shown below , our original bitmap has 8 bytes and 8 lines, so we can use another byte and 8 bits to record whether these 8 bytes are 0. This newly added bitmap is called grp .

If the original bitmap byte 1 is 0 , then the grp bit 0 is 0 , otherwise it is 1 , which means there is a 1 in this byte, that is, there is a free block. Then the problem is simple. Find the byte whose lowest byte is not 0 in these 8 bytes , that is, find the first 1 in the lowest bit of this grp. In this way, the problem is normalized, which is the table lookup we mentioned earlier. In this way, to find the first 1 in the lowest byte , it is a two-step table lookup. First, find the byte position according to the grp , and then find the bit position according to the value of this byte. The final index is " byte position * 8 + bit position "

The final algorithm is implemented as follows

y = s_unmaptbl[dev->grp];    x = s_unmaptbl[dev->tbl[y]];    index = (uint8_t)((y<<3) + x);

The dev->grp here is the grp byte above .

Dev->tbl is the bitmap byte array.

y is the byte position, x is the bit position, and index is the index position of the bitmap .

Find the bitmap index and change the corresponding bit to 0 , that is, apply. The operation is as follows: index>>3 gets the byte index, ~(1u<<(index & 0x07) clears the corresponding bit . If this byte becomes 0 , the corresponding bit of the corresponding grp byte must also be cleared

      if(((dev->tbl[index>>3] &= (~(1u<<(index & 0x07)))) == 0))        {            dev->grp &= ~(1u<<(index>>3));        }

To release, set the bit corresponding to the index position of the bitmap to 1 .

    dev->grp |= 1u<<(index>>3);    dev->tbl[index>>3] |= 1u<<(index & 0x07);

Here are some examples:

Bytes 2 to 7 of Tbl are all non- zero , so bit2 to bit7 of grp are all 1 .

The 1 that appears in the lowest bit of the lowest byte in tbl is bit 3 of byte 2 .

The calculation process is as follows. First, grp=0xFC=252 is found in the table. The value at index 252 is 2 , so the byte 2 is located.

The value of Tbl[2] is 0x08 , and the value corresponding to this index is 3.

So the first 1 is bit 3 of byte 2 , which is 2*8+3=19.

The source code is as follows

Mem_pool.c

#include "mem_pool.h"
static uint8_t  const  s_unmaptbl[256] = {    0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x00 to 0x0F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x10 to 0x1F                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x20 to 0x2F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x30 to 0x3F                             */    6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x40 to 0x4F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x50 to 0x5F                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x60 to 0x6F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x70 to 0x7F                             */    7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x80 to 0x8F                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0x90 to 0x9F                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xA0 to 0xAF                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xB0 to 0xBF                             */    6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xC0 to 0xCF                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xD0 to 0xDF                             */    5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,       /* 0xE0 to 0xEF                             */    4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0        /* 0xF0 to 0xFF                             */};
int mem_pool_init(mem_pool_st* dev, void* buffer, void* tbl, uint32_t itemsize, uint32_t itemnum){    #if MEM_POOL_PARAM_CHECK        if(dev == (mem_pool_st*)0)        {            return -1;        }        if((buffer == (void*)0) || (tbl == (void*)0))        {            return -1;        }    #endif    dev->buffer = buffer;    dev->itemnum = itemnum;    dev->itemsize = itemsize;    dev->grp = (uint8_t)0xFF;    dev->tbl = tbl;    for(uint32_t i=0; i< (dev->itemnum+7)/8; i++)    {        dev->tbl[i] = (uint8_t)0xFF;    }    return 0;}
uint8_t* mem_pool_malloc(mem_pool_st* dev){    /* 查找最靠前的,空闲的块 */#if MEM_POOL_PARAM_CHECK    if(dev == (mem_pool_st*)0)    {        return (void*)0;    }    if((dev->buffer == (void*)0) || (dev->tbl == (uint8_t*)0))    {        return (void*)0;    }#endif
    uint8_t y;    uint8_t x;    uint8_t index;    y = s_unmaptbl[dev->grp];    x = s_unmaptbl[dev->tbl[y]];    index = (uint8_t)((y<<3) + x);    if((index == 0) && (dev->tbl[0]==0))    {        /* 找到的索引为0,且对应的tbl为0, 说明bitmap所有位都为0,即无空闲 */        return (void*)0;    }    else    {        /* 对应bit置0,表示已经占用 */        if(((dev->tbl[index>>3] &= (~(1u<<(index & 0x07)))) == 0))        {            dev->grp &= ~(1u<<(index>>3));        }        /* 返回对应的地址 */        return dev->buffer + dev->itemsize*index;    }}
int mem_pool_free(mem_pool_st* dev, uint8_t* buffer){#if MEM_POOL_PARAM_CHECK    if(dev == (mem_pool_st*)0)    {        return -1;    }    if((dev->buffer == (void*)0) || (dev->tbl == (uint8_t*)0))    {        return -1;    }#endif
#if MEM_POOL_ERR_CHECK    if(buffer == (void*)0)    {        return -2;    }#endif    uint8_t index;    index = (buffer - dev->buffer)/dev->itemsize;#if MEM_POOL_ERR_CHECK    if(((dev->grp & (1u<<(index>>3))) != 0) && ((dev->tbl[index>>3] & (1u<<(index & 0x07))) != 0))    {        /* 释放空闲的块 */        return -3;    }#endif    /* 设置对应bit位置为1表示空闲 */    dev->grp |= 1u<<(index>>3);    dev->tbl[index>>3] |= 1u<<(index & 0x07);    return 0;}

Mem_pool.h

#ifndef MEM_POOL_H#define MEM_POOL_H
#ifdef __cplusplusextern "C" {#endif  #include <stdint.h>
#define MEM_POOL_PARAM_CHECK 1#define MEM_POOL_ERR_CHECK 1
typedef struct{    uint32_t   grp;    uint32_t   itemsize; /**< 区块大小                               */    uint32_t   itemnum;  /**< 总区块数,8的倍数,最多256                */    uint8_t*   buffer;   /**< 用户提供的缓存区,大小为itemsize*itemnum  */    uint8_t*   tbl;} mem_pool_st;
int mem_pool_init(mem_pool_st* dev, void* buffer, void* tbl, uint32_t itemsize, uint32_t itemnum);uint8_t* mem_pool_malloc(mem_pool_st* dev);int mem_pool_free(mem_pool_st* dev, uint8_t* buffer);
#ifdef __cplusplus}#endif
#endif

4. Actual measurement

We replaced it with the malloc and free calls in one of our driver implementations, tested the functionality , and tested its timing jitter and execution time as described in step 2, confirming that the execution time was very short and there was basically no jitter.

The initialization code is as follows

  #define TRANS_POOL_NUM 64    #define GROUP_POOL_NUM 64    static dma_trans_t s_trans_buffer[TRANS_POOL_NUM];    static uint8_t s_trans_bitmap_buffer[(TRANS_POOL_NUM+7)/8];    static mem_group_info_t s_group_buffer[GROUP_POOL_NUM];    static uint8_t s_group_bitmap_buffer[(GROUP_POOL_NUM+7)/8];
    mem_pool_st s_trans_mem_pool_dev;    mem_pool_st s_group_mem_pool_dev;    mem_pool_init(&s_trans_mem_pool_dev, s_trans_buffer, s_trans_bitmap_buffer, sizeof(dma_trans_t), TRANS_POOL_NUM);    mem_pool_init(&s_group_mem_pool_dev, s_group_buffer, s_group_bitmap_buffer, sizeof(mem_group_info_t), GROUP_POOL_NUM);

Apply

trans = (dma_trans_t *)mem_pool_malloc(&s_trans_mem_pool_dev);

freed

mem_pool_free(&s_trans_mem_pool_dev, trans);

5. Conclusion

We have applied the algorithm in uCOSII to the memory pool wheel we designed, and realized an efficient memory pool without time jitter. It is a very useful wheel. This also reminds us to pay more attention to learning from others' ideas and implementations, and use them for our own benefit and continue to accumulate.

The above actually uses bitmap lookup to achieve fixed time, and uses two-dimensional bitmap to expand capacity. The above example supports a maximum of 64 blocks , that is, 8x8 . What if you want to expand the number of blocks to a larger number? Naturally, you can add two-dimensional X/Y , that is, change tbl and grp from 8 bits to 16 bits, but there is a problem. There are 2^16 cases for the table changed to 16- bit lookup , which requires 64k , which is probably unacceptable in embedded systems. Let's think about the process of expanding tbl to grp , that is, the process of expanding X to Y , that is, the process of expanding from one dimension to two dimensions. Are we suddenly brightened up, slapped our hands, and even a little excited? Yes, it is natural that we can add another dimension to become three-dimensional, that is, for the position of the midpoint of the body, we can first find the surface, then find the row, and then find the bit point in the row . It is so simple and natural, so you only need to change grp to an array, and then add a vol byte to record the grp array. Vol records whether a byte of grp is 0 . Therefore, we can only truly understand a technology if we understand its essence. For example, thinking about table lookup from the perspective of dimensional expansion is the essence. Once you understand the essence, you will naturally expand it and apply it to other scenarios, rather than getting entangled in technical details such as table lookup. Getting entangled in these does not actually understand the essence, and it is difficult to draw inferences from this case. This is why we must think about the essence when thinking about problems, and why I have always emphasized that when thinking about technical problems, we must think about the essence and even the philosophical meaning behind them. The principles and even the philosophical meanings behind them are the universal essential principles.

We achieve expansion by adding dimensions, which is a dimensionality reduction attack. The three-order search becomes three table lookups, which is a bit similar to " the essence of logarithms is to reduce multiplication and division into addition and subtraction . " Perhaps the creatures in the four-dimensional space look down on us, the creatures in the three-dimensional space.

I can't help but marvel at the wonders of the world. Perhaps only the Creator can do it. The ability to reduce dimensions is actually a common phenomenon. It may be that although the Creator did not let us live in a high-dimensional world, he opened a back door for us, allowing us to have the ability to perform dimensionality reduction analysis on high dimensions. Logarithmic dimensionality reduction, projection dimensionality reduction, etc. are similar, which allows us to infer the whole picture from the side. I don’t know if we can establish a system from a mathematical abstract perspective to prove what kind of space has the properties that can be reduced in dimension. Then this will establish a mathematical system for analyzing high dimensions.

Latest articles about

■Interpretation of opensbi link script

■"Learning FPGA at low cost based on "mining board"" Using JTAG boundary scan to quickly reverse the pin correspondence

■Detailed explanation of opensbi serial port driver

■"Learning FPGA at low cost based on "mining board"" Porting OpenC906 Extra - Introduction to iverilog+gtkwave environment

■"Learning FPGA at low cost based on "mining board"" Porting OpenC906 Part 3 - Running simulation

■"Learning FPGA at low cost based on "mining board"" Solidify the program to SPI FLASH

■"Learning FPGA at low cost based on "mining board"" Porting OpenC906 Part 2 - Constraint synthesis to generate bit files and IO expansion board design

■"Learning FPGA at low cost based on "mining board"" Porting OpenC906 Part 1 - Adding code synthesis

■"Learning FPGA at low cost based on "mining board"" Reverse scanning of all IO pin mapping of socket

■"Learning FPGA at low cost based on "mining board"" Using ILA internal logic analyzer to analyze the signal of LED project