In the existing block-based video coding and decoding systems, when the bit rate is low, there are always block effects, and this is also the case in the new video coding standard H.264. There are two main reasons for this block effect: first, after the block-based integer transformation of the transformed residual coefficients, the quantization of the transform coefficients with a large quantization step will cause discontinuity in the block edges of the decoded reconstructed image; second, the error caused by the interpolation operation in motion compensation causes the reconstructed image after the codec inverse transformation to have block effects. If not processed, the block effect will accumulate with the reconstructed frame, thus seriously affecting the image quality and compression efficiency. In order to solve this problem, the deblocking filtering technology in H.264 uses a more complex adaptive filter to effectively remove this block effect. Therefore, how to optimize the deblocking filtering algorithm in real-time video decoding, reduce the computational complexity, and improve the quality of the reconstructed image has become a key issue in H.264 decoding.
1 Deblocking filter of H.264
1.1 Filtering principle
A large quantization step will cause a relatively large quantization error, which may turn the continuous grayscale between pixels at the "border" of adjacent blocks into a "step" change, and subjectively there will be a "pseudo-edge" block effect. The method of deblocking is to restore these step-like step grayscale changes to grayscale changes with very small steps or approximately continuous steps while keeping the total energy of the image unchanged, and at the same time, it is necessary to minimize the damage to the real image edge.
1.2 Adaptive filtering process
In H.264, the deblocking filter is performed in sequence according to the 16×16 pixel macroblock. In the macroblock, the edges between each 4×4 sub-block are first vertically and then horizontally, so as to filter all edges (except image edges) in the entire reconstructed image. The specific edge diagram is shown in Figure 1. For a 16×16 pixel luminance macroblock, there are 4 vertical edges and 4 horizontal edges, and each edge is divided into 16 pixel edges. The corresponding 8×8 pixel chrominance macroblock has 2 vertical edges and 2 horizontal edges, and each edge is divided into 8 pixel edges. Pixel edge is the basic unit for filtering.
1.2.1 Adaptability of filter at two levels The
deblocking filter in H.264 has a good filtering effect because of its adaptability at the following two levels.
1) Adaptive
filtering of the filter at the 4×4 sub-block level is based on the pixel edges in each sub-block. By defining a parameter BS (edge strength) for each pixel edge, the strength of the filter and the pixels involved are adaptively adjusted. The pixel edge strength of the chrominance block is the same as the corresponding brightness pixel edge strength. Assume that P and Q are two adjacent 4×4 sub-blocks, and the pixel edge strengths therein are obtained by the steps in Figure 2. The larger the value of BS, the stronger the filtering on both sides of the corresponding edge. This is set according to the reason for the block effect. For example, if the block effect of the sub-block using the intra-frame prediction mode is more obvious, a larger pixel edge strength value is set for the corresponding edge in the sub-block for strong filtering.
2) White adaptability of the filter at the pixel level
A good filtering effect can only be achieved by correctly distinguishing between false edges caused by quantization error and motion compensation and real boundaries in the image. Usually, the pixel gradient difference on both sides of the real boundary is larger than the pixel gradient difference on both sides of the false boundary. Therefore, the filter determines the true and false boundaries by setting a threshold α for the gradient difference of the grayscale values of the pixels on both sides of the edge and a threshold β for the gradient difference of the grayscale values of adjacent pixels on the same side. The values of α and β are mainly related to the quantization step size. When the quantization step size is large, the quantization error is also large, the block effect is obvious, and false boundaries are easily generated. Therefore, the threshold value becomes larger and the filtering conditions are relaxed. On the contrary, when the quantization step size is small, the threshold value also becomes smaller, which reflects the adaptability. The setting of the sampling point is shown in Figure 3. If all conditions are met, filtering begins.
In addition to these two adaptivities, the intensity of the filter can also be adjusted by setting the coefficients LoopFilterAlphaC0Offset and LoopFilterBetaOffset at the slice level. For example, when the transmission bit rate is low, the block effect is more obvious. If the receiver wants an image with relatively good subjective quality, the encoder can set the filter offsets LoopFilterAlphaC0Offset and LoopFilterBetaOffset in the header information to positive values to increase α and β to strengthen the filtering and improve the subjective quality of the image by removing the block effect. Or for high-resolution images, the filtering can be weakened by transmitting a negative offset to keep the image details as much as possible.
1.2.2 Filtering adjacent pixels according to the BS value of each pixel edge
If the current pixel edge meets the filtering conditions, the corresponding filter is selected according to its corresponding BS value for filtering and appropriate shearing operations are performed to prevent image blur.
When the BS value is 1, 2, or 3, a 4-tap linear filter is used to filter and adjust the input P1, P0, Q0, and Q1 to obtain new Q0 and P0. If there is a false boundary inside, the values of Q1 and P1 are further adjusted.
When the BS value is 4, it corresponds to the macroblock edge using the intra-frame coding mode, and a stronger filter should be used to achieve the purpose of enhancing image quality. For the luminance component, if the condition (| P0~Q0 | <((α》2)+2))&abs(P2-P0) holds, a 5-tap filter is selected to filter P0 and P2, and a stronger 4-tap filter is used to filter P1; if the condition does not hold, only a weaker 3-tap filter is used to filter P0, while the values of P1 and P2 remain unchanged. For the chrominance component, if the above conditions are met, a 3-tap filter is performed on P0, and if the conditions are not met, all pixel values are not modified. The filtering operation on Q0, Q1, and Q2 is the same as the filtering operation on P0, P1, and P2. [page]
2 Characteristics and structure of BF533
Our H.264 deblocking filter is implemented on the Blackfin ADSP-BF533 processor of ADI. The Blackfin series DSP has the following main features:
a) Highly parallel computing unit. The core of the Blackfin series DSP architecture is the DAU (data arithmetic unit), which includes two 16-bit MACs (multiplication accumulators), two 40-bit ALUs (arithmetic logic units), one 40-bit single barrel shifter, and four 8-bit video ALUs. Each MAC can perform 16-bit by 16-bit multiplication operations on four independent data operands in a single clock cycle. The 40-bit ALU can accumulate two 40-bit numbers or four 16-bit numbers. This architecture can flexibly perform 8-value, 16-bit, and 32-bit data operations.
b) Dynamic power management. The processor can consume less power than other DSPs by changing the voltage and operating frequency. The Blackfin series DSP architecture allows voltage and frequency to be adjusted independently, so that the energy consumption of each task is minimized, and there is a good balance between performance and power consumption. It is suitable for the development of real-time video encoders/decoders, especially real-time motion video processing with strict requirements on power consumption.
c) High-performance address generator. It has two DAGs (data address generators
) for generating composite loading or storage units that support addresses for advanced DSP filtering operations. It supports bit-reversed addressing, circular buffering, and other addressing modes, which improves programming flexibility.
d) Hierarchical memory. Hierarchical memory shortens the kernel's access time to memory to achieve maximum data throughput, less latency, and shortened processing idle time.
e) Unique video operation instructions. It provides operation instructions commonly used in video compression standards such as DCT (discrete cosine transform) and Huffman coding. These video instructions also eliminate the complex and easily mixed communication problems between the main processor and an independent video codec. These features help to shorten the time to market for terminal applications and reduce the overall cost of the system.
The ADSP-BF533 we use can achieve continuous operation at 600 MHz and has: 4 GB of unified addressing space; 80 kB SRAM L1 instruction memory, of which 16 kB can be configured as a 4-way joint cache; 2 32 kB SRAM L1 data memories, half of which can be configured as cache; and integrated rich peripherals and interfaces.
3 Optimized implementation of H.264 deblocking filter based on BF533 The
optimization implementation of the deblocking filter on Blackfin BF533 is mainly divided into three levels: system-level optimization, algorithm-level optimization, and assembly-level optimization.
3.1 System-level optimization
Open the optimization options of the compiler in the DSP platform and set the optimization speed to the fastest, open the Automatic Inlining switch and the Interprocedural optimization switch, and give full play to the hardware performance of Blackfin BF533 through the above settings.
3.2 Algorithm-level optimization
The deblocking filter part in the JM8.6 reference model is appropriately modified and transplanted to the original H.264 basic level decoder based on Blackfin BF533, and the time consumption is analyzed through image sequences. The Paris.cif, Mobile.cif, Foreman.cif, and Claire.cif sequences with a bit rate of about 400 kbit/s are selected. The clock cycle consumed by the deblocking filter is about 1 600 MHz to 1 800 MHz. Even after system optimization, the calculation complexity is still quite large and the efficiency is very low, which is a considerable burden for the continuous working frequency of 600 MHz of the Blackfin BF533 processor.
By analyzing the deblocking filter program in JM8.6, the main reasons for its low efficiency are:
a) The function logic relationship in the algorithm is complex, and there are many judgments, jumps, function calls, etc.;
b) The most time-consuming part, that is, there are a lot of repeated calculations inside the function loop, which causes a sharp increase in calculation complexity;
c) Many data used in the algorithm, such as motion vectors, image brightness and chrominance data, are stored in the slower off-chip SDRAM, but the frequent calls in the filtering process increase the data transfer time.
In view of the time-consuming reasons, the algorithm has been improved as follows:
3.2.1 Simplify the complex functions and loops in the original program. The
instruction length and operation speed are mutually restricted. The code can often be highly streamlined through conditional judgment, but the speed is slowed down due to the increase in the machine's judgment workload; on the contrary, removing the judgment in the code and expanding the program can often reduce the instruction cycles consumed, but the code length will increase. The deblocking filter code in JM8.6 is short, and the relationship between the functions is simplified, so as to increase the execution speed in exchange for the increase in code length.
For the loop that takes the most time to run, appropriate rewriting of the loop form and multiple loop expansion methods are adopted to effectively reduce the complexity of the operation. In addition, reducing the number of function calls and rewriting if-else statements are also effective optimization methods.
3.2.2 Removing a large number of redundant codes and repeated calculations in the reference code
a) Because the reference code used is the deblocking filter module in JM8.6, which can filter the bitstreams of various levels and levels of H.264, and the decoder is based on the basic level, it only involves the filtering operations of I frames and P frames. Therefore, the relevant filtering parts of the reference code about B frames, SP/SI frames, field modes and frame-field adaptive modes can be removed.
b) In the process of obtaining the filter strength BS and performing brightness/chrominance filtering, the program must obtain the information on the reachability of the adjacent macroblocks of the macroblock where the current sub-block is located (that is, whether this macroblock can be used, which is realized by calling the GetNeighbour function). Since the filtering is performed vertically and then horizontally according to the edges in the macroblock, the information obtained for an edge is the same, so this operation can be obtained once for each edge, and there is no need to repeatedly judge inside the loop. At the same time, in the filtering algorithm, it is only necessary to obtain the accessibility information of the macroblocks above and to the left of the current macroblock, and the redundant operation of obtaining the information of the upper left and upper right macroblocks of the current macroblock can be removed. At the same time, when the function that obtains the horizontal filtering strength calls getNeighbour, the values of the getNeighbour parameters are luma as a constant 1, xN is [-1, 3, 7, 11], and yN is [0-15]. At this time, many if-else statements in the getNeighbour function are invalid judgments, and these redundant judgments take up a lot of clock cycles. In addition, the probability of each branch is analyzed, and the judgment branch with the highest probability is executed in front, which also improves the execution speed of the function. [page]
The following is the streamlined GetNeighbour function code, which has only a few statements, greatly reducing the amount of calculation.
c) In the JM86 reference code, the BS values of the 64 pixel edges of the 16×4 of a luminance macroblock are obtained one by one. By analyzing the BS acquisition conditions, it can be seen that the BS values of the four pixel edges at the vertical edge or horizontal edge between two sub-blocks are equal. Therefore, for an edge, only the BS values of the 1st, 5th, 9th, and 13th pixel edges need to be obtained, and then assigned to the corresponding other pixel edges. Since the operation of obtaining the BS value is located in the loop, it requires many judgments and calculations. Through this improvement, the calculation complexity is greatly simplified.
d) There are many statements inside the loop in the reference code that are irrelevant to the loop parameters. These statements can be adjusted outside the loop to avoid redundant calculations.
3.2.3 Using BPP block processing technology to solve the problem of frequent calls to off-chip data
In order to solve the problem that frequent calls to off-chip data affect the running speed of the program, BPP block technology is used for optimization. Three spaces are opened in the L1 cache on the chip to store the luminance component, chrominance U component, and chrominance V component to be filtered. According to the pixel range that may be involved when filtering each macroblock, when filtering the CIF image, the 396 macroblocks of a frame are divided into 4 categories: Category A is the first macroblock, whose upper edge and left edge are both image edges, and the luminance data read in before filtering is 16×16, and the chrominance data is two 8×8; Category B is the remaining macroblocks in the first macroblock row except the first macroblock, whose upper edge is the image edge, and the luminance data read in before filtering is 16×20, and the chrominance data is two 8×12; Category C is the remaining macroblocks in the first macroblock column except the first macroblock, whose left edge is the image edge, and the luminance data read in before filtering is 20×16, and the chrominance data is two 12×8; Category D is the remaining macroblocks except the macroblocks of categories A, B, and C, that is, the macroblocks whose upper edge and left edge are both in the current image, and the luminance data read in before filtering is 20×20, and the chrominance data is two 12×12.
When filtering, first read the luminance and chrominance data from the off-chip data cache in different quantities according to the type of macroblock into the three filter caches on the chip, then perform filtering, and store the result data back into the off-chip storage space. This method reduces the time of frequent off-chip data calls to a certain extent, improving the running speed; on the other hand, by subdividing the filtered macroblocks, it reduces the pipeline interruption caused by the judgment in the reference code, and also improves the program speed to a certain extent.
3.3 Assembly-level optimization
The kernel of the Blackfin BF533 processor supports C or C++ languages, but the system's automatic translation of C programs into assembly languages is inefficient. Therefore, for some modules that are frequently called by the system and take a long time, they can be manually converted into high-efficiency assembly languages to improve the running speed. The speed of the program can be improved mainly through the following aspects:
a) Replace local variables with register variables. In C language, local variables are often used in subroutines and functions to temporarily store data. When the program is running, the compiler allocates temporary memory space for all declared local variables. The access operations of local variables involve memory access, and the speed of memory access is very slow compared to register access. Therefore, the data registers and pointer registers in the system can be used to replace local variables that only serve as temporary storage, thereby greatly saving the time delay caused by the system accessing memory. However, since the number of registers in the system is quite limited for local variables, registers must be used reasonably and efficiently.
b) Replace software loops with hardware loops. Software loops refer to setting judgment conditions at the beginning or end of loops such as for or while to control the start, continuation, and end of the loop. The conditional judgment instructions of the software loop will dynamically select branches. Once a jump occurs, it will block the pipeline. Keeping the pipeline unblocked is the key factor to maintain efficient operation. The Blackfin processor has dedicated hardware to support two-level nested zero-overhead hardware loops. This method does not require judgment of conditional transfers. The DSP hardware automatically executes and ends the loop according to the predetermined number of loops, thereby ensuring the smooth flow of the pipeline and improving the speed.
c) Make full use of the data bus width. The Blackfin533 external data bus width is 32 bits, and 4 bytes can be accessed at a time. Therefore, making full use of the total data access width, especially when operating a large amount of data, keeping 4 bytes accessed at a time can reduce the number of instruction cycles, thereby improving the execution speed.
d) Efficient use of parallel instructions and vector instructions. Parallel instructions and vector instructions are a major feature of the Blackfin series DSP. By using parallel instructions, the advantages of the Blackfin processor's SIMD system structure and the parallel processing capabilities of hardware resources can be fully utilized to reduce the number of instructions, thereby improving the efficiency of program execution. Often, through the reasonable arrangement of the program, one parallel instruction can be used to replace two or three non-parallel instructions. Vector instructions make full use of the instruction width and perform the same operation on multiple data streams at the same time. For example, if you want to perform two 16-bit arithmetic or shift operations, you can use one 32-bit vector instruction to achieve this, so that the original two-cycle work can be achieved with one clock cycle. For example, R3=abs R1(V) uses one instruction cycle to achieve the absolute value operation of two 16-bit data at the same time.
e) Reasonable allocation of data storage space. Limited to the access speed and capacity characteristics of the DSP on-chip and off-chip data storage space, the on-chip space has a fast access speed but a small capacity, while the off-chip space is larger but has a slow access speed. Therefore, it is very important to reasonably allocate data storage locations to improve the running speed of the program. For data with high frequency of use, try to put it in the on-chip space, and put infrequently used data in the off-chip space. If you want to access data located outside the chip, you should try to arrange the data to be accessed in a continuous distribution, and read large blocks of off-chip data into the on-chip cache at one time to avoid the time waste caused by frequent reading of off-chip data.
4 Results of Optimization
The method to test the optimization effect is to add the deblocking filter C program module in the reference code JM8.6 to the original decoder for testing, and compare the test cycle with the deblocking filter assembly program module optimized at three levels: system, algorithm, and assembly. The selected test image sequences are Clarie.cif, Paris.cif, and Mobile.cif. The test data is shown in Table 1. As
can be seen from Table 1, compared with the C program code in JM8.6 before optimization, the efficiency of the optimized deblocking filter assembly module has increased by about 7 times.
5 Conclusion
This paper implements the deblocking filter function in H.264 through optimization at three levels: system, algorithm, and assembly. In particular, by improving the implementation algorithm of the deblocking filter, classifying the macroblocks to be filtered, and making full use of assembly-level optimization methods such as parallel instructions and vector instructions, a good optimization effect has been achieved. The optimized deblocking filter module, based on the original H.264 decoder, filters a 25-frame image sequence of about 400 kbit/s, which requires about 250 MHz clock cycles, while the total cycle of the decoder is about 700 MHz clock cycles, so that the decoding speed of the decoder reaches about 20 frames/s, basically meeting the requirements of quasi-real-time decoding.
This implementation method is better optimized than the reference module, but through the time-consuming analysis of the program, there is still room for further improvement in reading the data to be filtered and rewriting the filtered data, the GetBs function for obtaining the BS value, and the EdgeLoop function for filtering. For the interaction of off-chip and on-chip data, DMA technology can be used to read and write data while filtering, thereby offsetting the clock cycles consumed by data movement; there is still room for further improvement in the efficiency of the assembly code implementation in GetBs and EdgeLoop; these two aspects are also the next improvement direction.
Keywords:ADSP-BF533
Reference address:Implementation and Optimization of Deblocking Filter Based on ADSP-BF533 Processor
1 Deblocking filter of H.264
1.1 Filtering principle
A large quantization step will cause a relatively large quantization error, which may turn the continuous grayscale between pixels at the "border" of adjacent blocks into a "step" change, and subjectively there will be a "pseudo-edge" block effect. The method of deblocking is to restore these step-like step grayscale changes to grayscale changes with very small steps or approximately continuous steps while keeping the total energy of the image unchanged, and at the same time, it is necessary to minimize the damage to the real image edge.
1.2 Adaptive filtering process
In H.264, the deblocking filter is performed in sequence according to the 16×16 pixel macroblock. In the macroblock, the edges between each 4×4 sub-block are first vertically and then horizontally, so as to filter all edges (except image edges) in the entire reconstructed image. The specific edge diagram is shown in Figure 1. For a 16×16 pixel luminance macroblock, there are 4 vertical edges and 4 horizontal edges, and each edge is divided into 16 pixel edges. The corresponding 8×8 pixel chrominance macroblock has 2 vertical edges and 2 horizontal edges, and each edge is divided into 8 pixel edges. Pixel edge is the basic unit for filtering.
1.2.1 Adaptability of filter at two levels The
deblocking filter in H.264 has a good filtering effect because of its adaptability at the following two levels.
1) Adaptive
filtering of the filter at the 4×4 sub-block level is based on the pixel edges in each sub-block. By defining a parameter BS (edge strength) for each pixel edge, the strength of the filter and the pixels involved are adaptively adjusted. The pixel edge strength of the chrominance block is the same as the corresponding brightness pixel edge strength. Assume that P and Q are two adjacent 4×4 sub-blocks, and the pixel edge strengths therein are obtained by the steps in Figure 2. The larger the value of BS, the stronger the filtering on both sides of the corresponding edge. This is set according to the reason for the block effect. For example, if the block effect of the sub-block using the intra-frame prediction mode is more obvious, a larger pixel edge strength value is set for the corresponding edge in the sub-block for strong filtering.
2) White adaptability of the filter at the pixel level
A good filtering effect can only be achieved by correctly distinguishing between false edges caused by quantization error and motion compensation and real boundaries in the image. Usually, the pixel gradient difference on both sides of the real boundary is larger than the pixel gradient difference on both sides of the false boundary. Therefore, the filter determines the true and false boundaries by setting a threshold α for the gradient difference of the grayscale values of the pixels on both sides of the edge and a threshold β for the gradient difference of the grayscale values of adjacent pixels on the same side. The values of α and β are mainly related to the quantization step size. When the quantization step size is large, the quantization error is also large, the block effect is obvious, and false boundaries are easily generated. Therefore, the threshold value becomes larger and the filtering conditions are relaxed. On the contrary, when the quantization step size is small, the threshold value also becomes smaller, which reflects the adaptability. The setting of the sampling point is shown in Figure 3. If all conditions are met, filtering begins.
In addition to these two adaptivities, the intensity of the filter can also be adjusted by setting the coefficients LoopFilterAlphaC0Offset and LoopFilterBetaOffset at the slice level. For example, when the transmission bit rate is low, the block effect is more obvious. If the receiver wants an image with relatively good subjective quality, the encoder can set the filter offsets LoopFilterAlphaC0Offset and LoopFilterBetaOffset in the header information to positive values to increase α and β to strengthen the filtering and improve the subjective quality of the image by removing the block effect. Or for high-resolution images, the filtering can be weakened by transmitting a negative offset to keep the image details as much as possible.
1.2.2 Filtering adjacent pixels according to the BS value of each pixel edge
If the current pixel edge meets the filtering conditions, the corresponding filter is selected according to its corresponding BS value for filtering and appropriate shearing operations are performed to prevent image blur.
When the BS value is 1, 2, or 3, a 4-tap linear filter is used to filter and adjust the input P1, P0, Q0, and Q1 to obtain new Q0 and P0. If there is a false boundary inside, the values of Q1 and P1 are further adjusted.
When the BS value is 4, it corresponds to the macroblock edge using the intra-frame coding mode, and a stronger filter should be used to achieve the purpose of enhancing image quality. For the luminance component, if the condition (| P0~Q0 | <((α》2)+2))&abs(P2-P0) holds, a 5-tap filter is selected to filter P0 and P2, and a stronger 4-tap filter is used to filter P1; if the condition does not hold, only a weaker 3-tap filter is used to filter P0, while the values of P1 and P2 remain unchanged. For the chrominance component, if the above conditions are met, a 3-tap filter is performed on P0, and if the conditions are not met, all pixel values are not modified. The filtering operation on Q0, Q1, and Q2 is the same as the filtering operation on P0, P1, and P2. [page]
2 Characteristics and structure of BF533
Our H.264 deblocking filter is implemented on the Blackfin ADSP-BF533 processor of ADI. The Blackfin series DSP has the following main features:
a) Highly parallel computing unit. The core of the Blackfin series DSP architecture is the DAU (data arithmetic unit), which includes two 16-bit MACs (multiplication accumulators), two 40-bit ALUs (arithmetic logic units), one 40-bit single barrel shifter, and four 8-bit video ALUs. Each MAC can perform 16-bit by 16-bit multiplication operations on four independent data operands in a single clock cycle. The 40-bit ALU can accumulate two 40-bit numbers or four 16-bit numbers. This architecture can flexibly perform 8-value, 16-bit, and 32-bit data operations.
b) Dynamic power management. The processor can consume less power than other DSPs by changing the voltage and operating frequency. The Blackfin series DSP architecture allows voltage and frequency to be adjusted independently, so that the energy consumption of each task is minimized, and there is a good balance between performance and power consumption. It is suitable for the development of real-time video encoders/decoders, especially real-time motion video processing with strict requirements on power consumption.
c) High-performance address generator. It has two DAGs (data address generators
) for generating composite loading or storage units that support addresses for advanced DSP filtering operations. It supports bit-reversed addressing, circular buffering, and other addressing modes, which improves programming flexibility.
d) Hierarchical memory. Hierarchical memory shortens the kernel's access time to memory to achieve maximum data throughput, less latency, and shortened processing idle time.
e) Unique video operation instructions. It provides operation instructions commonly used in video compression standards such as DCT (discrete cosine transform) and Huffman coding. These video instructions also eliminate the complex and easily mixed communication problems between the main processor and an independent video codec. These features help to shorten the time to market for terminal applications and reduce the overall cost of the system.
The ADSP-BF533 we use can achieve continuous operation at 600 MHz and has: 4 GB of unified addressing space; 80 kB SRAM L1 instruction memory, of which 16 kB can be configured as a 4-way joint cache; 2 32 kB SRAM L1 data memories, half of which can be configured as cache; and integrated rich peripherals and interfaces.
3 Optimized implementation of H.264 deblocking filter based on BF533 The
optimization implementation of the deblocking filter on Blackfin BF533 is mainly divided into three levels: system-level optimization, algorithm-level optimization, and assembly-level optimization.
3.1 System-level optimization
Open the optimization options of the compiler in the DSP platform and set the optimization speed to the fastest, open the Automatic Inlining switch and the Interprocedural optimization switch, and give full play to the hardware performance of Blackfin BF533 through the above settings.
3.2 Algorithm-level optimization
The deblocking filter part in the JM8.6 reference model is appropriately modified and transplanted to the original H.264 basic level decoder based on Blackfin BF533, and the time consumption is analyzed through image sequences. The Paris.cif, Mobile.cif, Foreman.cif, and Claire.cif sequences with a bit rate of about 400 kbit/s are selected. The clock cycle consumed by the deblocking filter is about 1 600 MHz to 1 800 MHz. Even after system optimization, the calculation complexity is still quite large and the efficiency is very low, which is a considerable burden for the continuous working frequency of 600 MHz of the Blackfin BF533 processor.
By analyzing the deblocking filter program in JM8.6, the main reasons for its low efficiency are:
a) The function logic relationship in the algorithm is complex, and there are many judgments, jumps, function calls, etc.;
b) The most time-consuming part, that is, there are a lot of repeated calculations inside the function loop, which causes a sharp increase in calculation complexity;
c) Many data used in the algorithm, such as motion vectors, image brightness and chrominance data, are stored in the slower off-chip SDRAM, but the frequent calls in the filtering process increase the data transfer time.
In view of the time-consuming reasons, the algorithm has been improved as follows:
3.2.1 Simplify the complex functions and loops in the original program. The
instruction length and operation speed are mutually restricted. The code can often be highly streamlined through conditional judgment, but the speed is slowed down due to the increase in the machine's judgment workload; on the contrary, removing the judgment in the code and expanding the program can often reduce the instruction cycles consumed, but the code length will increase. The deblocking filter code in JM8.6 is short, and the relationship between the functions is simplified, so as to increase the execution speed in exchange for the increase in code length.
For the loop that takes the most time to run, appropriate rewriting of the loop form and multiple loop expansion methods are adopted to effectively reduce the complexity of the operation. In addition, reducing the number of function calls and rewriting if-else statements are also effective optimization methods.
3.2.2 Removing a large number of redundant codes and repeated calculations in the reference code
a) Because the reference code used is the deblocking filter module in JM8.6, which can filter the bitstreams of various levels and levels of H.264, and the decoder is based on the basic level, it only involves the filtering operations of I frames and P frames. Therefore, the relevant filtering parts of the reference code about B frames, SP/SI frames, field modes and frame-field adaptive modes can be removed.
b) In the process of obtaining the filter strength BS and performing brightness/chrominance filtering, the program must obtain the information on the reachability of the adjacent macroblocks of the macroblock where the current sub-block is located (that is, whether this macroblock can be used, which is realized by calling the GetNeighbour function). Since the filtering is performed vertically and then horizontally according to the edges in the macroblock, the information obtained for an edge is the same, so this operation can be obtained once for each edge, and there is no need to repeatedly judge inside the loop. At the same time, in the filtering algorithm, it is only necessary to obtain the accessibility information of the macroblocks above and to the left of the current macroblock, and the redundant operation of obtaining the information of the upper left and upper right macroblocks of the current macroblock can be removed. At the same time, when the function that obtains the horizontal filtering strength calls getNeighbour, the values of the getNeighbour parameters are luma as a constant 1, xN is [-1, 3, 7, 11], and yN is [0-15]. At this time, many if-else statements in the getNeighbour function are invalid judgments, and these redundant judgments take up a lot of clock cycles. In addition, the probability of each branch is analyzed, and the judgment branch with the highest probability is executed in front, which also improves the execution speed of the function. [page]
The following is the streamlined GetNeighbour function code, which has only a few statements, greatly reducing the amount of calculation.
c) In the JM86 reference code, the BS values of the 64 pixel edges of the 16×4 of a luminance macroblock are obtained one by one. By analyzing the BS acquisition conditions, it can be seen that the BS values of the four pixel edges at the vertical edge or horizontal edge between two sub-blocks are equal. Therefore, for an edge, only the BS values of the 1st, 5th, 9th, and 13th pixel edges need to be obtained, and then assigned to the corresponding other pixel edges. Since the operation of obtaining the BS value is located in the loop, it requires many judgments and calculations. Through this improvement, the calculation complexity is greatly simplified.
d) There are many statements inside the loop in the reference code that are irrelevant to the loop parameters. These statements can be adjusted outside the loop to avoid redundant calculations.
3.2.3 Using BPP block processing technology to solve the problem of frequent calls to off-chip data
In order to solve the problem that frequent calls to off-chip data affect the running speed of the program, BPP block technology is used for optimization. Three spaces are opened in the L1 cache on the chip to store the luminance component, chrominance U component, and chrominance V component to be filtered. According to the pixel range that may be involved when filtering each macroblock, when filtering the CIF image, the 396 macroblocks of a frame are divided into 4 categories: Category A is the first macroblock, whose upper edge and left edge are both image edges, and the luminance data read in before filtering is 16×16, and the chrominance data is two 8×8; Category B is the remaining macroblocks in the first macroblock row except the first macroblock, whose upper edge is the image edge, and the luminance data read in before filtering is 16×20, and the chrominance data is two 8×12; Category C is the remaining macroblocks in the first macroblock column except the first macroblock, whose left edge is the image edge, and the luminance data read in before filtering is 20×16, and the chrominance data is two 12×8; Category D is the remaining macroblocks except the macroblocks of categories A, B, and C, that is, the macroblocks whose upper edge and left edge are both in the current image, and the luminance data read in before filtering is 20×20, and the chrominance data is two 12×12.
When filtering, first read the luminance and chrominance data from the off-chip data cache in different quantities according to the type of macroblock into the three filter caches on the chip, then perform filtering, and store the result data back into the off-chip storage space. This method reduces the time of frequent off-chip data calls to a certain extent, improving the running speed; on the other hand, by subdividing the filtered macroblocks, it reduces the pipeline interruption caused by the judgment in the reference code, and also improves the program speed to a certain extent.
3.3 Assembly-level optimization
The kernel of the Blackfin BF533 processor supports C or C++ languages, but the system's automatic translation of C programs into assembly languages is inefficient. Therefore, for some modules that are frequently called by the system and take a long time, they can be manually converted into high-efficiency assembly languages to improve the running speed. The speed of the program can be improved mainly through the following aspects:
a) Replace local variables with register variables. In C language, local variables are often used in subroutines and functions to temporarily store data. When the program is running, the compiler allocates temporary memory space for all declared local variables. The access operations of local variables involve memory access, and the speed of memory access is very slow compared to register access. Therefore, the data registers and pointer registers in the system can be used to replace local variables that only serve as temporary storage, thereby greatly saving the time delay caused by the system accessing memory. However, since the number of registers in the system is quite limited for local variables, registers must be used reasonably and efficiently.
b) Replace software loops with hardware loops. Software loops refer to setting judgment conditions at the beginning or end of loops such as for or while to control the start, continuation, and end of the loop. The conditional judgment instructions of the software loop will dynamically select branches. Once a jump occurs, it will block the pipeline. Keeping the pipeline unblocked is the key factor to maintain efficient operation. The Blackfin processor has dedicated hardware to support two-level nested zero-overhead hardware loops. This method does not require judgment of conditional transfers. The DSP hardware automatically executes and ends the loop according to the predetermined number of loops, thereby ensuring the smooth flow of the pipeline and improving the speed.
c) Make full use of the data bus width. The Blackfin533 external data bus width is 32 bits, and 4 bytes can be accessed at a time. Therefore, making full use of the total data access width, especially when operating a large amount of data, keeping 4 bytes accessed at a time can reduce the number of instruction cycles, thereby improving the execution speed.
d) Efficient use of parallel instructions and vector instructions. Parallel instructions and vector instructions are a major feature of the Blackfin series DSP. By using parallel instructions, the advantages of the Blackfin processor's SIMD system structure and the parallel processing capabilities of hardware resources can be fully utilized to reduce the number of instructions, thereby improving the efficiency of program execution. Often, through the reasonable arrangement of the program, one parallel instruction can be used to replace two or three non-parallel instructions. Vector instructions make full use of the instruction width and perform the same operation on multiple data streams at the same time. For example, if you want to perform two 16-bit arithmetic or shift operations, you can use one 32-bit vector instruction to achieve this, so that the original two-cycle work can be achieved with one clock cycle. For example, R3=abs R1(V) uses one instruction cycle to achieve the absolute value operation of two 16-bit data at the same time.
e) Reasonable allocation of data storage space. Limited to the access speed and capacity characteristics of the DSP on-chip and off-chip data storage space, the on-chip space has a fast access speed but a small capacity, while the off-chip space is larger but has a slow access speed. Therefore, it is very important to reasonably allocate data storage locations to improve the running speed of the program. For data with high frequency of use, try to put it in the on-chip space, and put infrequently used data in the off-chip space. If you want to access data located outside the chip, you should try to arrange the data to be accessed in a continuous distribution, and read large blocks of off-chip data into the on-chip cache at one time to avoid the time waste caused by frequent reading of off-chip data.
4 Results of Optimization
The method to test the optimization effect is to add the deblocking filter C program module in the reference code JM8.6 to the original decoder for testing, and compare the test cycle with the deblocking filter assembly program module optimized at three levels: system, algorithm, and assembly. The selected test image sequences are Clarie.cif, Paris.cif, and Mobile.cif. The test data is shown in Table 1. As
can be seen from Table 1, compared with the C program code in JM8.6 before optimization, the efficiency of the optimized deblocking filter assembly module has increased by about 7 times.
5 Conclusion
This paper implements the deblocking filter function in H.264 through optimization at three levels: system, algorithm, and assembly. In particular, by improving the implementation algorithm of the deblocking filter, classifying the macroblocks to be filtered, and making full use of assembly-level optimization methods such as parallel instructions and vector instructions, a good optimization effect has been achieved. The optimized deblocking filter module, based on the original H.264 decoder, filters a 25-frame image sequence of about 400 kbit/s, which requires about 250 MHz clock cycles, while the total cycle of the decoder is about 700 MHz clock cycles, so that the decoding speed of the decoder reaches about 20 frames/s, basically meeting the requirements of quasi-real-time decoding.
This implementation method is better optimized than the reference module, but through the time-consuming analysis of the program, there is still room for further improvement in reading the data to be filtered and rewriting the filtered data, the GetBs function for obtaining the BS value, and the EdgeLoop function for filtering. For the interaction of off-chip and on-chip data, DMA technology can be used to read and write data while filtering, thereby offsetting the clock cycles consumed by data movement; there is still room for further improvement in the efficiency of the assembly code implementation in GetBs and EdgeLoop; these two aspects are also the next improvement direction.
Previous article:MB91480 Series 32-bit Microcontrollers for White Goods
Next article:Design of Intelligent Multi-Service Voice System-on-Chip
Recommended Content
Latest Microcontroller Articles
- Learn ARM development(16)
- Learn ARM development(17)
- Learn ARM development(18)
- Embedded system debugging simulation tool
- A small question that has been bothering me recently has finally been solved~~
- Learn ARM development (1)
- Learn ARM development (2)
- Learn ARM development (4)
- Learn ARM development (6)
He Limin Column
Microcontroller and Embedded Systems Bible
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
MoreSelected Circuit Diagrams
MorePopular Articles
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
MoreDaily News
- Brief Analysis of Automotive Ethernet Test Content and Test Methods
- How haptic technology can enhance driving safety
- Let’s talk about the “Three Musketeers” of radar in autonomous driving
- Why software-defined vehicles transform cars from tools into living spaces
- How Lucid is overtaking Tesla with smaller motors
- Wi-Fi 8 specification is on the way: 2.4/5/6GHz triple-band operation
- Wi-Fi 8 specification is on the way: 2.4/5/6GHz triple-band operation
- Vietnam's chip packaging and testing business is growing, and supply-side fragmentation is splitting the market
- Vietnam's chip packaging and testing business is growing, and supply-side fragmentation is splitting the market
- Three steps to govern hybrid multicloud environments
Guess you like
- Questions about TEC control chip ADN8834
- Amplifier input and output voltage range and rail-to-rail misunderstanding
- msp430f5529 simple uart source program (use the serial port assistant to send and receive replies)
- TMS320F28335_SVPWM complete program
- C2000 Byte Processing
- Zero Crossing Detection
- Selection of Reed Relays
- EEWORLD University ---- Analog Circuit Basics: From System Level to Circuit Level
- [ESP32-Audio-Kit Audio Development Board Review] SD Card Reading and Writing
- STM32 branch uses register method instead of HAL to drive DAC