Super streamlined series: efficient and fixed-running-time memory pool implementation
I.
Introduction
Here, let's take
a
BUG
that
is easy to appear
in the development of
RTOS
environment
as an example to explain why we design the wheel of our memory pool. Taking
FreeRTOS
as an example, the implementation of heap management
malloc
and
free
is based on linked lists. Linked lists are global resources, and the operation is not an atomic operation, involving many steps. Therefore, critical section protection must be performed when multiple calls are made, otherwise the operation will be interrupted halfway and another operation will be performed, which will cause the resource to be destroyed.
FreeRTOS
's processing method is to implement critical section protection by turning off scheduling. Turning off scheduling can only ensure that the task will not preempt, but cannot ensure interrupt preemption, so
FreeRTOS
's
malloc
and
free
interfaces cannot be called in interrupts. Sometimes, driver design must use dynamic memory management. Many drivers are interrupt-driven. Sometimes the heap management interface must be called in the interrupt. At this time, directly calling the
FreeRTOS
heap management interface will cause problems. The above problems may be difficult to find and may not be tested. It may be at some coincidental time that the heap interface execution in the task is interrupted in the interrupt, and the heap management interface is called. At this time, problems will occur. This kind of problem is likely to have a lot of randomness, which leads to difficulties in debugging. Knowing this reason, similar errors can be avoided. But back to the question, what should we do if we still need to use dynamic memory in interrupts? The simplest way is to use
the heap interface provided by
FreeRTOS
and change the shutdown scheduling to shutdown interrupts. This ensures that the critical section is not preempted and there will be no problem. However, there will be some side effects, which is the increase in the shutdown interrupt time, and this time is still uncertain. We will conduct actual tests in the next section. Of course, heap management also has problems such as fragmentation, so it is not suitable for calling at the driver layer and interrupts. So we also have another way, which is to use a memory pool. This is a common way for many middleware and drivers. This article is to implement a simple and efficient memory pool, but we are not satisfied with this. We have higher requirements. After all, it is used in the driver layer or even interrupts. We hope that it will execute very quickly and the execution time will be basically non-volatile, because these two indicators are very important and many systems must even meet them.
2.
Heap management interface execution time test
There are many ways to test execution time, generally using hardware timers. Here we use another method that is also commonly used in practice, using
IO
flipping and viewing the waveform with an oscilloscope. When entering the interface execution, flip
the IO
, exit the interface execution, and flip
the IO
again
. The pulse width of the IO waveform is the execution time. This can be visualized. The change in density can show the jitter of the execution time, and the pulse width can show the execution time.
By comparing
maloc
and
free
using different
IOs
, we can also see the time distribution of application and release, etc., so this is a very good method.
I have measured a certain platform here, and the execution time of malloc and free of freertos is as follows (note that the execution time is different on different hardware platforms). It can be seen that the time jitter is large and the time is not short. 4.6uS is already very long. For example, this time may not be able to meet the requirements for timing analysis in interrupts.
Later, we compared the modified version with our efficient memory pool implementation and found that there was basically no jitter and the execution time was very short
(
the execution and
IO
response time
of the flip
IO
function itself
have not been removed
)
. From the assembly code, we can also see that there are very few instructions.
3.
Memory pool implementation
Conventional design
Divide a block of space into
n
blocks of equal size, with a field in front to mark whether it is used, and the rest as valid storage space. Initially, all are free, and the mark is
0.
When you need to apply, search from the beginning. If you find a block marked as
0
,
you can apply to change the mark to
1.
If you release it, directly match the address and set the corresponding mark to
0.
The above mark needs to occupy at least one byte, and will affect the alignment property of the valid block before being inserted into the valid block, so it can be optimized and the mark information can be recorded separately using
a bitmap
, as shown below:
Valid blocks can be aligned continuously according to the required original data structure.
One
bit
in the bitmap
corresponds to one block.
1
means the block has been allocated
and
0
means it is free.
The bitmap
also reduces the space occupied. The application and release algorithm are similar to the original one. The application is just changed to search
the position of
0
starting from the lowest bit of the lowest byte of
the bitmap
and set it to
1
to obtain the corresponding block address. The release is to calculate
the bit
index
of
the bitmap
according to the address
and set the corresponding
bit
to
0.
Highly efficient design without time jitter
The above
bitmap
implementation has been optimized in terms of storage, but there is jitter in the execution time because of the traversal search. In order to meet our time jitter-free design, the goal is to achieve a fixed algorithm time for finding the lowest byte with the lowest starting
position of
0.
In fact, many of our previous articles have mentioned the idea of contradiction, that is, looking at problems from the perspective of contradiction. To solve a problem, you can look at its opposite, because contradictions are often one that grows while the other grows. The time and space of the algorithm are contradictory. If time is required, then we will see if we can consider it from the perspective of space. A common method of exchanging space for time is table lookup. Table lookup has a fixed time and is also efficient. In the actual
implementation of
uSOSII
, this algorithm is used to find the highest ready priority. We directly use it and briefly introduce its principle.
First, we
arrange
the bitmap
in a matrix, and naturally think of bytes as columns,
bits 0 to 7
correspond to
columns
0 to 7
, and multiple bytes correspond to multiple rows. Each
bit
corresponds to a block, and
a bit
of
1
indicates free, and a bit of
0
indicates occupied.
Then the lowest bit of the lowest byte is
1
, which is the free block we are looking for. So how to find this
bit
quickly
? The general idea is
to start
from byte
0.
If all are
0
, continue to look for byte
1
, find a byte that is not
0
, and then
find the position of the first
1
from
bit 0
to
bit 7.
Then the index of this
bit
in the entire
bit
is
(
byte number
* 8 +
bit number
).
So the time can be decomposed into two steps, the first step is to find the byte position, the second step is to find the bit position.
In fact, we can quickly find the bit position by looking up the
table
.
There are 256
possibilities
for
8-
bit data
. For each possibility, we just
need to write out the position
where the lowest
1
appears. In fact, we construct a
256-
byte array table, and the value of the specified index is the number corresponding to this index, and the bit position where
1
first
appears
.
for example
00000000
has no
1
, so the lowest
position where
1
appears first is
0
, so
the position value of
array index
1
is
0
00000001
The
first
position where
1
appears
is
bit 0
, so
the value of the position
at array index
1
is
0
00000010
The
first
position where
1
appears
is
bit1
, so
the value of the position
at array index
2
is
1
00000011
The
first
position where
1
appears
is
bit 0
, so
the value of the position
at array index
3
is
0
00000100
The first position where
1
appears
is
bit 2
, so
the value of the position
at array index
4
is
2
So finally construct the following table
static uint8_t const s_unmaptbl[256] = {
0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x00 to 0x0F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x10 to 0x1F */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x20 to 0x2F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x30 to 0x3F */
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x40 to 0x4F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x50 to 0x5F */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x60 to 0x6F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x70 to 0x7F */
7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x80 to 0x8F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x90 to 0x9F */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xA0 to 0xAF */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xB0 to 0xBF */
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xC0 to 0xCF */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xD0 to 0xDF */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xE0 to 0xEF */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0 /* 0xF0 to 0xFF */
};
Now we can find the position of the first
1
in the byte
by looking up the table in one step. The time is fixed. So we can start from byte
0
to find the first byte that is not
0
according to the previous introduction
. Let's think about it carefully. In the previous byte,
8
bits
per
row
were used to find the corresponding column of the first
1.
Now, isn't it the corresponding row to find the byte?
So we first record
whether each line is
0
, and then use a
bitmap
to record whether this line is
0.
As shown below
,
our original
bitmap
has
8
bytes and
8
lines, so we
can
use another byte
and
8
bits
to record whether these
8
bytes are
0.
This newly added
bitmap
is called
grp
.
If the original
bitmap
byte
1
is
0
,
then
the
grp
bit 0
is
0
, otherwise it is
1
, which means there is
a 1
in this byte,
that is, there is a free block. Then the problem is simple. Find the
byte
whose lowest byte is not
0
in these
8
bytes
, that is, find
the first 1 in the lowest bit of this grp. In this way, the problem is normalized, which is the table lookup we mentioned earlier. In this way, to find the first
1
in
the lowest byte
,
it is a two-step table lookup. First,
find the byte position
according to
the grp
, and then find
the bit
position according to the value of this byte. The final index is
"
byte position
* 8 + bit
position
"
The final algorithm is implemented as follows
y = s_unmaptbl[dev->grp];
x = s_unmaptbl[dev->tbl[y]];
index = (uint8_t)((y<<3) + x);
The dev->grp
here
is
the grp
byte
above
.
Dev->tbl
is
the
bitmap
byte array.
y
is the byte position,
x
is
the bit
position, and
index
is
the index position of
the bitmap
.
Find the bitmap index and change the corresponding bit to 0 , that is, apply. The operation is as follows: index>>3 gets the byte index, ~(1u<<(index & 0x07) clears the corresponding bit . If this byte becomes 0 , the corresponding bit of the corresponding grp byte must also be cleared
if(((dev->tbl[index3] &= (~(1u<<(index & 0x07)))) == 0))
{
dev->grp &= ~(1u<<(index3));
}
To release,
set
the bit
corresponding to the index position of
the bitmap
to
1
.
dev->grp |= 1u<<(index>>3);
dev->tbl[index>>3] |= 1u<<(index & 0x07);
Here are some examples:
Bytes
2 to 7
of
Tbl
are all non-
zero
, so
bit2 to bit7
of
grp
are all
1
.
The 1
that appears in the lowest bit of the lowest byte in
tbl
is
bit 3
of
byte
2
.
The calculation process is as follows. First,
grp=0xFC=252
is found in the table.
The value at
index
252
is
2
, so the byte
2
is located.
The value of
Tbl[2]
is
0x08
, and the value corresponding to this index is
3.
So the first 1 is bit 3 of byte 2 , which is 2*8+3=19.
The source code is as follows
Mem_pool.c
static uint8_t const s_unmaptbl[256] = {
0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x00 to 0x0F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x10 to 0x1F */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x20 to 0x2F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x30 to 0x3F */
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x40 to 0x4F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x50 to 0x5F */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x60 to 0x6F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x70 to 0x7F */
7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x80 to 0x8F */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0x90 to 0x9F */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xA0 to 0xAF */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xB0 to 0xBF */
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xC0 to 0xCF */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xD0 to 0xDF */
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, /* 0xE0 to 0xEF */
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0 /* 0xF0 to 0xFF */
};
int mem_pool_init(mem_pool_st* dev, void* buffer, void* tbl, uint32_t itemsize, uint32_t itemnum)
{
if(dev == (mem_pool_st*)0)
{
return -1;
}
if((buffer == (void*)0) || (tbl == (void*)0))
{
return -1;
}
dev->buffer = buffer;
dev->itemnum = itemnum;
dev->itemsize = itemsize;
dev->grp = (uint8_t)0xFF;
dev->tbl = tbl;
for(uint32_t i=0; i< (dev->itemnum+7)/8; i++)
{
dev->tbl[i] = (uint8_t)0xFF;
}
return 0;
}
uint8_t* mem_pool_malloc(mem_pool_st* dev)
{
/* 查找最靠前的,空闲的块 */
if(dev == (mem_pool_st*)0)
{
return (void*)0;
}
if((dev->buffer == (void*)0) || (dev->tbl == (uint8_t*)0))
{
return (void*)0;
}
uint8_t y;
uint8_t x;
uint8_t index;
y = s_unmaptbl[dev->grp];
x = s_unmaptbl[dev->tbl[y]];
index = (uint8_t)((y<<3) + x);
if((index == 0) && (dev->tbl[0]==0))
{
/* 找到的索引为0,且对应的tbl为0, 说明bitmap所有位都为0,即无空闲 */
return (void*)0;
}
else
{
/* 对应bit置0,表示已经占用 */
if(((dev->tbl[index>>3] &= (~(1u<<(index & 0x07)))) == 0))
{
dev->grp &= ~(1u<<(index>>3));
}
/* 返回对应的地址 */
return dev->buffer + dev->itemsize*index;
}
}
int mem_pool_free(mem_pool_st* dev, uint8_t* buffer)
{
if(dev == (mem_pool_st*)0)
{
return -1;
}
if((dev->buffer == (void*)0) || (dev->tbl == (uint8_t*)0))
{
return -1;
}
if(buffer == (void*)0)
{
return -2;
}
uint8_t index;
index = (buffer - dev->buffer)/dev->itemsize;
if(((dev->grp & (1u<<(index>>3))) != 0) && ((dev->tbl[index>>3] & (1u<<(index & 0x07))) != 0))
{
/* 释放空闲的块 */
return -3;
}
/* 设置对应bit位置为1表示空闲 */
dev->grp |= 1u<<(index>>3);
dev->tbl[index>>3] |= 1u<<(index & 0x07);
return 0;
}
Mem_pool.h
extern "C" {
typedef struct
{
uint32_t grp;
uint32_t itemsize; /**< 区块大小 */
uint32_t itemnum; /**< 总区块数,8的倍数,最多256 */
uint8_t* buffer; /**< 用户提供的缓存区,大小为itemsize*itemnum */
uint8_t* tbl;
} mem_pool_st;
int mem_pool_init(mem_pool_st* dev, void* buffer, void* tbl, uint32_t itemsize, uint32_t itemnum);
uint8_t* mem_pool_malloc(mem_pool_st* dev);
int mem_pool_free(mem_pool_st* dev, uint8_t* buffer);
}
4.
Actual measurement
We replaced it with
the malloc
and
free
calls in one of our driver implementations, tested the functionality
,
and tested its timing jitter and execution time as described in step 2, confirming that the execution time was very short and there was basically no jitter.
The initialization code is as follows
static dma_trans_t s_trans_buffer[TRANS_POOL_NUM];
static uint8_t s_trans_bitmap_buffer[(TRANS_POOL_NUM+7)/8];
static mem_group_info_t s_group_buffer[GROUP_POOL_NUM];
static uint8_t s_group_bitmap_buffer[(GROUP_POOL_NUM+7)/8];
mem_pool_st s_trans_mem_pool_dev;
mem_pool_st s_group_mem_pool_dev;
mem_pool_init(&s_trans_mem_pool_dev, s_trans_buffer, s_trans_bitmap_buffer, sizeof(dma_trans_t), TRANS_POOL_NUM);
mem_pool_init(&s_group_mem_pool_dev, s_group_buffer, s_group_bitmap_buffer, sizeof(mem_group_info_t), GROUP_POOL_NUM);
Apply
trans = (dma_trans_t *)mem_pool_malloc(&s_trans_mem_pool_dev);
freed
mem_pool_free(&s_trans_mem_pool_dev, trans);
5.
Conclusion
We
have applied the algorithm in
uCOSII
to the memory pool wheel we designed, and realized an efficient memory pool without time jitter. It is a very useful wheel. This also reminds us to pay more attention to learning from others' ideas and implementations, and use them for our own benefit and continue to accumulate.
The above actually uses bitmap lookup to achieve fixed time, and uses two-dimensional bitmap to expand capacity. The above example supports a maximum of 64 blocks , that is, 8x8 . What if you want to expand the number of blocks to a larger number? Naturally, you can add two-dimensional X/Y , that is, change tbl and grp from 8 bits to 16 bits, but there is a problem. There are 2^16 cases for the table changed to 16- bit lookup , which requires 64k , which is probably unacceptable in embedded systems. Let's think about the process of expanding tbl to grp , that is, the process of expanding X to Y , that is, the process of expanding from one dimension to two dimensions. Are we suddenly brightened up, slapped our hands, and even a little excited? Yes, it is natural that we can add another dimension to become three-dimensional, that is, for the position of the midpoint of the body, we can first find the surface, then find the row, and then find the bit point in the row . It is so simple and natural, so you only need to change grp to an array, and then add a vol byte to record the grp array. Vol records whether a byte of grp is 0 . Therefore, we can only truly understand a technology if we understand its essence. For example, thinking about table lookup from the perspective of dimensional expansion is the essence. Once you understand the essence, you will naturally expand it and apply it to other scenarios, rather than getting entangled in technical details such as table lookup. Getting entangled in these does not actually understand the essence, and it is difficult to draw inferences from this case. This is why we must think about the essence when thinking about problems, and why I have always emphasized that when thinking about technical problems, we must think about the essence and even the philosophical meaning behind them. The principles and even the philosophical meanings behind them are the universal essential principles.
We achieve expansion by adding dimensions, which is a dimensionality reduction attack. The three-order search becomes three table lookups, which is a bit similar to
"
the essence of logarithms is to reduce multiplication and division into addition and subtraction
.
"
Perhaps the creatures in the four-dimensional space look down on us, the creatures in the three-dimensional space.
I can't help but marvel at the wonders of the world. Perhaps only the Creator can do it. The ability to reduce dimensions is actually a common phenomenon. It may be that although the Creator did not let us live in a high-dimensional world, he opened a back door for us, allowing us to have the ability to perform dimensionality reduction analysis on high dimensions. Logarithmic dimensionality reduction, projection dimensionality reduction, etc. are similar, which allows us to infer the whole picture from the side. I don’t know if we can establish a system from a mathematical abstract perspective to prove what kind of space has the properties that can be reduced in dimension. Then this will establish a mathematical system for analyzing high dimensions.