Introduction
In existing block-based video coding systems, there are always block effects when the bit rate is low, and this is also true in the new video coding standard H.264. There are two main reasons for this block effect: first, after the block-based integer transform of the transformed residual coefficients, quantizing the transform coefficients with a large quantization step will cause discontinuity in the block edges of the decoded reconstructed image; second, the error caused by the interpolation operation in motion compensation causes the reconstructed image after the codec inverse transform to have a block effect. If not processed, the block effect will accumulate with the reconstructed frame, seriously affecting the image quality and compression efficiency. In order to solve this problem, the deblocking filter technology in H.264 uses a more complex adaptive filter to effectively remove this block effect. Therefore, how to optimize the deblocking filter algorithm in real-time video decoding, reduce the computational complexity, and improve the quality of the reconstructed image has become a key issue in H.264 decoding. 1 H.264 deblocking filter 1.1 Filtering principle A large quantization step size will cause a relatively large quantization error, which may turn the original grayscale continuity between pixels at the "border" of adjacent blocks into a "step" change, and subjectively there will be a "pseudo-edge" block effect. The method of deblocking is to restore these step-like step grayscale changes to grayscale changes with very small steps or approximately continuous steps while keeping the total energy of the image unchanged, and at the same time, it is necessary to minimize the damage to the real image edge. 1.2 Adaptive filtering process In H.264, the deblocking filter is performed in units of 16×16 pixel macroblocks. In the macroblock, the edges between each 4×4 sub-block are first vertically and then horizontally, so as to filter all edges (except image edges) in the entire reconstructed image. The specific edge diagram is shown in Figure 1. For a 16×16 pixel luminance macroblock, there are 4 vertical edges and 4 horizontal edges, and each edge is divided into 16 pixel edges. The corresponding 8×8 pixel chrominance macroblock has 2 vertical edges and 2 horizontal edges, and each edge is divided into 8 pixel edges. Pixel edge is the basic unit for filtering.
1.2.1 Adaptability of the filter at two levels
The deblocking filter in H.264 has a good filtering effect because of its adaptability at the following two levels. 1) Adaptability of the filter at the 4×4 sub-block level The filtering is based on the pixel edges in each sub-block. A parameter BS (edge strength) is defined for each pixel edge to adaptively adjust the strength of the filter and the pixels involved. The pixel edge strength of the chrominance block is the same as the corresponding luminance pixel edge strength. Assume that P and Q are two adjacent 4×4 sub-blocks, and the pixel edge strengths therein are obtained by the steps in Figure 2. The larger the value of BS, the stronger the filtering on both sides of the corresponding edge. This is set according to the reason for the block effect. For example, if the block effect of the sub-block using the intra-frame prediction mode is more obvious, a larger pixel edge strength value is set for the corresponding edge in the sub-block for strong filtering.
2) White adaptability of the filter at the pixel level
A good filtering effect can only be achieved by correctly distinguishing between false edges caused by quantization errors and motion compensation and real boundaries in the image. Usually, the pixel gradient difference on both sides of the real boundary is larger than the pixel gradient difference on both sides of the false boundary. Therefore, the filter determines the true and false boundaries by setting a threshold α for the gradient difference of the grayscale values of the pixels on both sides of the edge and a threshold β for the gradient difference of the grayscale values of adjacent pixels on the same side. The values of α and β are mainly related to the quantization step size. When the quantization step size is large, the quantization error is also large, the block effect is obvious, and false boundaries are easily generated. Therefore, the threshold value increases accordingly, and the filtering conditions are relaxed. On the contrary, when the quantization step size is small, the threshold value also decreases, which reflects the adaptability. The setting of the sampling points is shown in Figure 3. If all conditions are met, filtering begins.
In addition to these two adaptabilities, the intensity of the filter can also be adjusted by setting the coefficients LoopFilterAlphaC0Offset and LoopFilterBetaOffset at the slice level. For example, when the transmission bit rate is low, the block effect is more obvious. If the receiver wants an image with relatively good subjective quality, the encoder can set the filter offsets LoopFilterAlphaC0Offset and LoopFilterBetaOffset in the slice header information to positive values to increase α and β to strengthen the filter and improve the subjective quality of the image by removing the block effect. Or for high-resolution images, the filter can be weakened by transmitting a negative offset to keep the details of the image as much as possible. 1.2.2 Filtering adjacent pixels according to the BS value of each pixel edge If the current pixel edge meets the filtering conditions, the corresponding filter is selected according to its corresponding BS value for filtering and appropriate shearing operations are performed to prevent image blurring. When the BS value is 1, 2, or 3, a 4-tap linear filter is used to filter and adjust the input P1, P0, Q0, and Q1 to obtain new Q0 and P0. If there is a false boundary inside, the values of Q1 and P1 are further adjusted. When the BS value is 4, it corresponds to the macroblock edge in intra-frame coding mode, and a stronger filter should be used to enhance the image quality. For the luminance component, if the condition (| P0~Q0 | <((α》2)+2))&abs(P2-P0) is met, a 5-tap filter is selected to filter P0 and P2, and a stronger 4-tap filter is used to filter P1; if the condition is not met, only a weaker 3-tap filter is used to filter P0, while the values of P1 and P2 remain unchanged. For the chrominance component, if the above conditions are met, a 3-tap filter is applied to P0, and if the conditions are not met, all pixel values are not modified. The filtering operation for Q0, Q1, and Q2 is the same as that for P0, P1, and P2. 2 Features and structure of BF533 Our H.264 deblocking filter is implemented on the Blackfin ADSP-BF533 processor of ADI. The Blackfin series DSP has the following main features: a) Highly parallel computing unit. The core of the Blackfin series DSP architecture is the DAU (data arithmetic unit), which includes two 16-bit MACs (multiplication accumulators), two 40-bit ALUs (arithmetic logic units), one 40-bit single barrel shifter, and four 8-bit video ALUs. Each MAC can perform 16-bit by 16-bit multiplication operations on four independent data operands in a single clock cycle. The 40-bit ALU can accumulate two 40-bit numbers or four 16-bit numbers. This architecture can flexibly perform 8-value, 16-bit, and 32-bit data operations. b) Dynamic power management. The processor can consume less power than other DSPs by changing the voltage and operating frequency. The Blackfin series DSP architecture allows independent adjustment of voltage and frequency, which minimizes the energy consumption of each task and has a good balance between performance and power consumption. It is suitable for the development of real-time video encoders/decoders, especially real-time motion video processing with strict requirements on power consumption. c) High-performance address generator. It has two DAGs (data address generators) for generating composite loading or storage units that support addresses for advanced DSP filtering operations. It supports bit-reversed addressing, circular buffering, and other addressing methods to increase programming flexibility. d) Hierarchical memory. Hierarchical memory shortens the core's access time to memory to achieve maximum data throughput, less latency, and reduced processing idle time. e) Unique video operation instructions. It provides operation instructions commonly used in video compression standards such as DCT (discrete cosine transform) and Huffman coding. These video instructions also eliminate the complex and easily mixed communication problems between the main processor and an independent video codec. These features help shorten the time to market for terminal applications while reducing the overall cost of the system. The ADSP-BF533 we use can achieve 600 MHz continuous operation and has: 4 GB unified addressing space; 80 kB SRAM L1 instruction memory, of which 16 kB can be configured as a 4-way joint cache; 2 32 kB SRAM L1 data memories, half of which can be configured as cache; integrated with rich peripherals and interfaces. 3 Optimized implementation of H.264 deblocking filter based on BF533 The optimization implementation of deblocking filter in Blackfin BF533 is mainly divided into three levels: system-level optimization, algorithm-level optimization, and assembly-level optimization. 3.1 System-level optimization Turn on the optimization option of the compiler in the DSP platform and set the optimization speed to the fastest, turn on the Automatic Inlining switch (automatic inline switch) and the Interprocedural optimization switch (optimization process switch), and give full play to the hardware performance of Blackfin BF533 through the above settings. 3.2 Optimization at the algorithm level The deblocking filter part in the JM8.6 reference model was appropriately modified and transplanted to the original H.264 basic level decoder based on Blackfin BF533, and its time consumption was analyzed through image sequences. The Paris.cif, Mobile.cif, Foreman.cif, and Claire.cif sequences with a bit rate of about 400 kbit/s were selected. The clock cycle consumed by the deblocking filter was about 1 600 MHz to 1 800 MHz. Even after system optimization, the calculation complexity was still quite large and the efficiency was very low, which was a considerable burden for the continuous working frequency of 600 MHz of the Blackfin BF533 processor. By analyzing the deblocking filter program in JM8.6, the main reasons for its low efficiency are: a) The function logic relationship in the algorithm is complex, and there are many judgments, jumps, function calls, etc.; b) The most time-consuming part, that is, there are a lot of repeated calculations inside the function loop, which causes a sharp increase in calculation complexity; c) Many data used in the algorithm, such as motion vectors, image brightness and chrominance data, are stored in the slower off-chip SDRAM, but the frequent calls in the filtering process increase the data transfer time. In view of the time-consuming reasons, the algorithm has been improved as follows: 3.2.1 Simplify the complex functions and loops in the original program The instruction length and operation speed are mutually restricted. The code can often be highly streamlined through conditional judgment, but the speed is slowed down due to the increase in the machine's judgment workload; on the contrary, removing the judgment in the code and expanding the program can often reduce the instruction cycles consumed, but the code length will increase. The deblocking filter code in JM8.6 is short, and the relationship between the functions is simplified, so as to increase the execution speed in exchange for the increase in code length. For the loop that takes the most time to run, we can effectively reduce the complexity of the operation by appropriately rewriting the loop form and expanding multiple loops. In addition, reducing the number of function calls and rewriting if-else statements are also effective optimization methods. 3.2.2 Remove a large amount of redundant code and repeated calculations in the reference code
a) Because the reference code used is the deblocking filter module in JM8.6, which can filter the code streams of various levels and levels of H.264, and the decoder is based on the basic level, which only involves the filtering operations of I frame and P frame, so the relevant filtering parts of B frame, SP/SI frame, field mode and frame field adaptive mode in the reference code can be removed.
b) In the process of obtaining the filter strength BS and performing the luminance/chrominance filtering, the program must obtain the accessibility information of the adjacent macroblocks of the macroblock where the current subblock is located (that is, whether this macroblock can be used, which is realized by calling the GetNeighbour function). Since the filtering is performed vertically and then horizontally according to the edges in the macroblock, the information obtained for an edge is the same, so this operation can be obtained once for each edge, without repeated judgment in the loop. At the same time, in the filtering algorithm, only the accessibility information of the macroblocks above and to the left of the current macroblock needs to be obtained, and the redundant operation of obtaining the information of the upper left and upper right macroblocks of the current macroblock can be removed. At the same time, when the function that obtains the horizontal filtering strength calls getNeighbour, the values of the getNeighbour parameters are luma, which is a constant value of 1, xN, which is [-1, 3, 7, 11], and yN, which is [0-15]. At this time, many if-else statements in the getNeighbour function are invalid judgments, and these redundant judgments take up a lot of clock cycles. In addition, the probability of each branch is analyzed, and the judgment branch with the highest probability is executed first, which also improves the execution speed of the function.
The following is the simplified GetNeighbour function code, which only has a few statements, greatly reducing the amount of calculation.
c) In the JM86 reference code, the BS values of the 64 pixel edges of 16×4 in a luminance macroblock are obtained one by one. By analyzing the BS acquisition conditions, it can be seen that the BS values of the four pixel edges at the vertical or horizontal edges between two sub-blocks are equal. Therefore, for an edge, only the BS values of the 1st, 5th, 9th, and 13th pixel edges need to be obtained, and then assigned to the corresponding other pixel edges. Since the operation of obtaining the BS value is located in the loop, it requires many judgments and calculations. Through this improvement, the calculation complexity is greatly simplified.
d) There are many statements in the loop in the reference code that are not related to the loop parameters. These statements can be adjusted outside the loop to avoid redundant calculations. 3.2.3 Using BPP block processing technology to solve the problem of frequent calls to off-chip data In order to solve the problem that frequent calls to off-chip data affect the running speed of the program, BPP block technology is used for optimization. Three spaces are opened in the L1 cache on the chip to store the luminance component, chrominance U component, and chrominance V component to be filtered. According to the pixel range that may be involved when filtering each macroblock, when filtering the CIF image, the 396 macroblocks of a frame are divided into 4 categories: Category A is the first macroblock, whose upper edge and left edge are both image edges, and the luminance data read in before filtering is 16×16, and the chrominance data is 2 8×8; Category B is the remaining macroblocks in the first macroblock row except the first macroblock, whose upper edge is the image edge, and the luminance data read in before filtering is 16×20, and the chrominance data is two 8×12; Category C is the remaining macroblocks in the first macroblock column except the first macroblock, whose left edge is the image edge, and the luminance data read in before filtering is 20×16, and the chrominance data is two 12×8; Category D is the remaining macroblocks except the macroblocks of categories A, B, and C, that is, the macroblocks whose upper edge and left edge are both in the current image, and the luminance data read in before filtering is 20×20, and the chrominance data is two 12×12. When filtering, first read the luminance and chrominance data from the off-chip data cache in different quantities according to the type of macroblock into the three filter caches on the chip, then perform filtering, and store the result data back into the off-chip storage space. This method can reduce the time of frequent calls to off-chip data to a certain extent, and improve the running speed; on the other hand, by subdividing the filtering macroblocks, it reduces the pipeline interruption caused by the judgment in the reference code, and also improves the program speed to a certain extent. 3.3 Assembly-level optimization The core of the BlackfinBF533 processor supports C or C++ language, but the system automatically translates C programs into assembly language inefficiently. Therefore, for some modules that are frequently called by the system and take a long time, they can be manually converted into high-efficiency assembly language to improve the running speed. The speed of the program is mainly improved through the following aspects: a) Replace local variables with register variables. In C language, local variables are often used in subroutines and functions to temporarily store data. When the program is running, the compiler allocates temporary memory space for all declared local variables. The access operation of local variables involves memory access, and the speed of memory access is very slow compared to register access. Therefore, the data registers and pointer registers in the system can be used to replace local variables that only serve as temporary storage, thereby greatly saving the time delay caused by system memory access. However, since the number of registers in the system is quite limited for local variables, registers must be used reasonably and efficiently. b) Replace software loops with hardware loops. Software loops refer to setting judgment conditions at the beginning or end of loops such as for or while to control the start, continuation, and end of the loop. The conditional judgment instructions of the software loop will dynamically select branches. Once a jump occurs, it will block the pipeline, and keeping the pipeline unblocked is the key factor to maintain efficient operation. The Blackfin processor has dedicated hardware to support two-level nested zero-overhead hardware loops. This method does not require judgment of conditional transfers. The DSP hardware automatically executes and ends the loop according to the predetermined number of loops, thereby ensuring the smooth flow of the pipeline and improving the speed. c) Make full use of the data bus width. The external data bus width of Blackfin533 is 32 bits, and 4 bytes can be accessed at a time. Therefore, making full use of the total data access width, especially when operating a large amount of data, maintaining 4 bytes at a time can reduce the number of instruction cycles, thereby improving the execution speed. d) Efficient use of parallel instructions and vector instructions. Parallel instructions and vector instructions are a major feature of the Blackfin series DSP. By using parallel instructions, the advantages of the Blackfin processor's SIMD system structure and the parallel processing capabilities of hardware resources can be fully utilized to reduce the number of instructions, thereby improving program execution efficiency. Often, through the reasonable arrangement of the program, one parallel instruction can be used to replace two or three non-parallel instructions. Vector instructions make full use of the instruction width and perform the same operation on multiple data streams at the same time. For example, if two 16-bit arithmetic or shift operations are to be performed, they can be completely implemented through a 32-bit vector instruction, thereby achieving the original two-cycle work with one clock cycle. For example, R3=abs R1(V) uses one instruction cycle to simultaneously implement the absolute value operation of two 16-bit data. e) Reasonable allocation of data storage space. Due to the access speed and capacity characteristics of the DSP on-chip and off-chip data storage space, the on-chip space has a fast access speed but a small capacity, while the off-chip space is large but has a slow access speed. Therefore, it is critical to reasonably allocate data storage locations to improve the running speed of the program. For data with high frequency of use, try to put it in the on-chip space, and put infrequently used data in the off-chip space. If you want to access data located off-chip, you should try to arrange the data to be accessed in a continuous distribution, read a large block of off-chip data into the on-chip cache at one time, and avoid the time waste caused by frequent reading of off-chip data. 4 Results of optimization implementation The method to test the optimization effect is to add the deblocking filter C program module in the reference code JM8.6 to the original decoder for testing, and compare it with the test cycle of the deblocking filter assembly program module optimized at three levels: system, algorithm, and assembly. The selected test image sequences are Clarie.cif, Paris.cif, and Mobile.cif. The test data is shown in Table 1.
As can be seen from Table 1, compared with the C program code in JM8.6 before optimization, the efficiency of the optimized deblocking filter assembly module has increased by about 7 times. 5 Conclusion This paper implements the deblocking filter function in H.264 through optimization at three levels: system, algorithm and assembly. In particular, it achieves good optimization results by improving the implementation algorithm of deblocking filter, classifying the macroblocks to be filtered, and making full use of assembly-level optimization methods such as parallel instructions and vector instructions. The optimized deblocking filter module, based on the original H.264 decoder, filters a 25-frame image sequence of about 400 kbit/s, which requires about 250 MHz clock cycle, while the total cycle of the decoder is about 700 MHz clock cycle, so that the decoding speed of the decoder reaches about 20 frames/s, which basically meets the requirements of quasi-real-time decoding. This implementation method is well optimized compared to the reference module, but through the time-consuming analysis of the program, there is still room for further improvement in reading the data to be filtered and rewriting the filtered data, the GetBs function for obtaining the BS value, and the EdgeLoop function for filtering. For the interaction of data inside and outside the chip, DMA technology can be used to read and write data while filtering, thereby offsetting the clock cycles consumed by data movement; there is still room for further improvement in the efficiency of the assembly code implementation in GetBs and EdgeLoop; these two aspects are also the next direction of improvement.
Keywords:H.264
Reference address:Realization and Optimization of Filter Based on ADSP-BF533 Processor
In existing block-based video coding systems, there are always block effects when the bit rate is low, and this is also true in the new video coding standard H.264. There are two main reasons for this block effect: first, after the block-based integer transform of the transformed residual coefficients, quantizing the transform coefficients with a large quantization step will cause discontinuity in the block edges of the decoded reconstructed image; second, the error caused by the interpolation operation in motion compensation causes the reconstructed image after the codec inverse transform to have a block effect. If not processed, the block effect will accumulate with the reconstructed frame, seriously affecting the image quality and compression efficiency. In order to solve this problem, the deblocking filter technology in H.264 uses a more complex adaptive filter to effectively remove this block effect. Therefore, how to optimize the deblocking filter algorithm in real-time video decoding, reduce the computational complexity, and improve the quality of the reconstructed image has become a key issue in H.264 decoding. 1 H.264 deblocking filter 1.1 Filtering principle A large quantization step size will cause a relatively large quantization error, which may turn the original grayscale continuity between pixels at the "border" of adjacent blocks into a "step" change, and subjectively there will be a "pseudo-edge" block effect. The method of deblocking is to restore these step-like step grayscale changes to grayscale changes with very small steps or approximately continuous steps while keeping the total energy of the image unchanged, and at the same time, it is necessary to minimize the damage to the real image edge. 1.2 Adaptive filtering process In H.264, the deblocking filter is performed in units of 16×16 pixel macroblocks. In the macroblock, the edges between each 4×4 sub-block are first vertically and then horizontally, so as to filter all edges (except image edges) in the entire reconstructed image. The specific edge diagram is shown in Figure 1. For a 16×16 pixel luminance macroblock, there are 4 vertical edges and 4 horizontal edges, and each edge is divided into 16 pixel edges. The corresponding 8×8 pixel chrominance macroblock has 2 vertical edges and 2 horizontal edges, and each edge is divided into 8 pixel edges. Pixel edge is the basic unit for filtering.
1.2.1 Adaptability of the filter at two levels
The deblocking filter in H.264 has a good filtering effect because of its adaptability at the following two levels. 1) Adaptability of the filter at the 4×4 sub-block level The filtering is based on the pixel edges in each sub-block. A parameter BS (edge strength) is defined for each pixel edge to adaptively adjust the strength of the filter and the pixels involved. The pixel edge strength of the chrominance block is the same as the corresponding luminance pixel edge strength. Assume that P and Q are two adjacent 4×4 sub-blocks, and the pixel edge strengths therein are obtained by the steps in Figure 2. The larger the value of BS, the stronger the filtering on both sides of the corresponding edge. This is set according to the reason for the block effect. For example, if the block effect of the sub-block using the intra-frame prediction mode is more obvious, a larger pixel edge strength value is set for the corresponding edge in the sub-block for strong filtering.
2) White adaptability of the filter at the pixel level
A good filtering effect can only be achieved by correctly distinguishing between false edges caused by quantization errors and motion compensation and real boundaries in the image. Usually, the pixel gradient difference on both sides of the real boundary is larger than the pixel gradient difference on both sides of the false boundary. Therefore, the filter determines the true and false boundaries by setting a threshold α for the gradient difference of the grayscale values of the pixels on both sides of the edge and a threshold β for the gradient difference of the grayscale values of adjacent pixels on the same side. The values of α and β are mainly related to the quantization step size. When the quantization step size is large, the quantization error is also large, the block effect is obvious, and false boundaries are easily generated. Therefore, the threshold value increases accordingly, and the filtering conditions are relaxed. On the contrary, when the quantization step size is small, the threshold value also decreases, which reflects the adaptability. The setting of the sampling points is shown in Figure 3. If all conditions are met, filtering begins.
In addition to these two adaptabilities, the intensity of the filter can also be adjusted by setting the coefficients LoopFilterAlphaC0Offset and LoopFilterBetaOffset at the slice level. For example, when the transmission bit rate is low, the block effect is more obvious. If the receiver wants an image with relatively good subjective quality, the encoder can set the filter offsets LoopFilterAlphaC0Offset and LoopFilterBetaOffset in the slice header information to positive values to increase α and β to strengthen the filter and improve the subjective quality of the image by removing the block effect. Or for high-resolution images, the filter can be weakened by transmitting a negative offset to keep the details of the image as much as possible. 1.2.2 Filtering adjacent pixels according to the BS value of each pixel edge If the current pixel edge meets the filtering conditions, the corresponding filter is selected according to its corresponding BS value for filtering and appropriate shearing operations are performed to prevent image blurring. When the BS value is 1, 2, or 3, a 4-tap linear filter is used to filter and adjust the input P1, P0, Q0, and Q1 to obtain new Q0 and P0. If there is a false boundary inside, the values of Q1 and P1 are further adjusted. When the BS value is 4, it corresponds to the macroblock edge in intra-frame coding mode, and a stronger filter should be used to enhance the image quality. For the luminance component, if the condition (| P0~Q0 | <((α》2)+2))&abs(P2-P0) is met, a 5-tap filter is selected to filter P0 and P2, and a stronger 4-tap filter is used to filter P1; if the condition is not met, only a weaker 3-tap filter is used to filter P0, while the values of P1 and P2 remain unchanged. For the chrominance component, if the above conditions are met, a 3-tap filter is applied to P0, and if the conditions are not met, all pixel values are not modified. The filtering operation for Q0, Q1, and Q2 is the same as that for P0, P1, and P2. 2 Features and structure of BF533 Our H.264 deblocking filter is implemented on the Blackfin ADSP-BF533 processor of ADI. The Blackfin series DSP has the following main features: a) Highly parallel computing unit. The core of the Blackfin series DSP architecture is the DAU (data arithmetic unit), which includes two 16-bit MACs (multiplication accumulators), two 40-bit ALUs (arithmetic logic units), one 40-bit single barrel shifter, and four 8-bit video ALUs. Each MAC can perform 16-bit by 16-bit multiplication operations on four independent data operands in a single clock cycle. The 40-bit ALU can accumulate two 40-bit numbers or four 16-bit numbers. This architecture can flexibly perform 8-value, 16-bit, and 32-bit data operations. b) Dynamic power management. The processor can consume less power than other DSPs by changing the voltage and operating frequency. The Blackfin series DSP architecture allows independent adjustment of voltage and frequency, which minimizes the energy consumption of each task and has a good balance between performance and power consumption. It is suitable for the development of real-time video encoders/decoders, especially real-time motion video processing with strict requirements on power consumption. c) High-performance address generator. It has two DAGs (data address generators) for generating composite loading or storage units that support addresses for advanced DSP filtering operations. It supports bit-reversed addressing, circular buffering, and other addressing methods to increase programming flexibility. d) Hierarchical memory. Hierarchical memory shortens the core's access time to memory to achieve maximum data throughput, less latency, and reduced processing idle time. e) Unique video operation instructions. It provides operation instructions commonly used in video compression standards such as DCT (discrete cosine transform) and Huffman coding. These video instructions also eliminate the complex and easily mixed communication problems between the main processor and an independent video codec. These features help shorten the time to market for terminal applications while reducing the overall cost of the system. The ADSP-BF533 we use can achieve 600 MHz continuous operation and has: 4 GB unified addressing space; 80 kB SRAM L1 instruction memory, of which 16 kB can be configured as a 4-way joint cache; 2 32 kB SRAM L1 data memories, half of which can be configured as cache; integrated with rich peripherals and interfaces. 3 Optimized implementation of H.264 deblocking filter based on BF533 The optimization implementation of deblocking filter in Blackfin BF533 is mainly divided into three levels: system-level optimization, algorithm-level optimization, and assembly-level optimization. 3.1 System-level optimization Turn on the optimization option of the compiler in the DSP platform and set the optimization speed to the fastest, turn on the Automatic Inlining switch (automatic inline switch) and the Interprocedural optimization switch (optimization process switch), and give full play to the hardware performance of Blackfin BF533 through the above settings. 3.2 Optimization at the algorithm level The deblocking filter part in the JM8.6 reference model was appropriately modified and transplanted to the original H.264 basic level decoder based on Blackfin BF533, and its time consumption was analyzed through image sequences. The Paris.cif, Mobile.cif, Foreman.cif, and Claire.cif sequences with a bit rate of about 400 kbit/s were selected. The clock cycle consumed by the deblocking filter was about 1 600 MHz to 1 800 MHz. Even after system optimization, the calculation complexity was still quite large and the efficiency was very low, which was a considerable burden for the continuous working frequency of 600 MHz of the Blackfin BF533 processor. By analyzing the deblocking filter program in JM8.6, the main reasons for its low efficiency are: a) The function logic relationship in the algorithm is complex, and there are many judgments, jumps, function calls, etc.; b) The most time-consuming part, that is, there are a lot of repeated calculations inside the function loop, which causes a sharp increase in calculation complexity; c) Many data used in the algorithm, such as motion vectors, image brightness and chrominance data, are stored in the slower off-chip SDRAM, but the frequent calls in the filtering process increase the data transfer time. In view of the time-consuming reasons, the algorithm has been improved as follows: 3.2.1 Simplify the complex functions and loops in the original program The instruction length and operation speed are mutually restricted. The code can often be highly streamlined through conditional judgment, but the speed is slowed down due to the increase in the machine's judgment workload; on the contrary, removing the judgment in the code and expanding the program can often reduce the instruction cycles consumed, but the code length will increase. The deblocking filter code in JM8.6 is short, and the relationship between the functions is simplified, so as to increase the execution speed in exchange for the increase in code length. For the loop that takes the most time to run, we can effectively reduce the complexity of the operation by appropriately rewriting the loop form and expanding multiple loops. In addition, reducing the number of function calls and rewriting if-else statements are also effective optimization methods. 3.2.2 Remove a large amount of redundant code and repeated calculations in the reference code
a) Because the reference code used is the deblocking filter module in JM8.6, which can filter the code streams of various levels and levels of H.264, and the decoder is based on the basic level, which only involves the filtering operations of I frame and P frame, so the relevant filtering parts of B frame, SP/SI frame, field mode and frame field adaptive mode in the reference code can be removed.
b) In the process of obtaining the filter strength BS and performing the luminance/chrominance filtering, the program must obtain the accessibility information of the adjacent macroblocks of the macroblock where the current subblock is located (that is, whether this macroblock can be used, which is realized by calling the GetNeighbour function). Since the filtering is performed vertically and then horizontally according to the edges in the macroblock, the information obtained for an edge is the same, so this operation can be obtained once for each edge, without repeated judgment in the loop. At the same time, in the filtering algorithm, only the accessibility information of the macroblocks above and to the left of the current macroblock needs to be obtained, and the redundant operation of obtaining the information of the upper left and upper right macroblocks of the current macroblock can be removed. At the same time, when the function that obtains the horizontal filtering strength calls getNeighbour, the values of the getNeighbour parameters are luma, which is a constant value of 1, xN, which is [-1, 3, 7, 11], and yN, which is [0-15]. At this time, many if-else statements in the getNeighbour function are invalid judgments, and these redundant judgments take up a lot of clock cycles. In addition, the probability of each branch is analyzed, and the judgment branch with the highest probability is executed first, which also improves the execution speed of the function.
The following is the simplified GetNeighbour function code, which only has a few statements, greatly reducing the amount of calculation.
c) In the JM86 reference code, the BS values of the 64 pixel edges of 16×4 in a luminance macroblock are obtained one by one. By analyzing the BS acquisition conditions, it can be seen that the BS values of the four pixel edges at the vertical or horizontal edges between two sub-blocks are equal. Therefore, for an edge, only the BS values of the 1st, 5th, 9th, and 13th pixel edges need to be obtained, and then assigned to the corresponding other pixel edges. Since the operation of obtaining the BS value is located in the loop, it requires many judgments and calculations. Through this improvement, the calculation complexity is greatly simplified.
d) There are many statements in the loop in the reference code that are not related to the loop parameters. These statements can be adjusted outside the loop to avoid redundant calculations. 3.2.3 Using BPP block processing technology to solve the problem of frequent calls to off-chip data In order to solve the problem that frequent calls to off-chip data affect the running speed of the program, BPP block technology is used for optimization. Three spaces are opened in the L1 cache on the chip to store the luminance component, chrominance U component, and chrominance V component to be filtered. According to the pixel range that may be involved when filtering each macroblock, when filtering the CIF image, the 396 macroblocks of a frame are divided into 4 categories: Category A is the first macroblock, whose upper edge and left edge are both image edges, and the luminance data read in before filtering is 16×16, and the chrominance data is 2 8×8; Category B is the remaining macroblocks in the first macroblock row except the first macroblock, whose upper edge is the image edge, and the luminance data read in before filtering is 16×20, and the chrominance data is two 8×12; Category C is the remaining macroblocks in the first macroblock column except the first macroblock, whose left edge is the image edge, and the luminance data read in before filtering is 20×16, and the chrominance data is two 12×8; Category D is the remaining macroblocks except the macroblocks of categories A, B, and C, that is, the macroblocks whose upper edge and left edge are both in the current image, and the luminance data read in before filtering is 20×20, and the chrominance data is two 12×12. When filtering, first read the luminance and chrominance data from the off-chip data cache in different quantities according to the type of macroblock into the three filter caches on the chip, then perform filtering, and store the result data back into the off-chip storage space. This method can reduce the time of frequent calls to off-chip data to a certain extent, and improve the running speed; on the other hand, by subdividing the filtering macroblocks, it reduces the pipeline interruption caused by the judgment in the reference code, and also improves the program speed to a certain extent. 3.3 Assembly-level optimization The core of the BlackfinBF533 processor supports C or C++ language, but the system automatically translates C programs into assembly language inefficiently. Therefore, for some modules that are frequently called by the system and take a long time, they can be manually converted into high-efficiency assembly language to improve the running speed. The speed of the program is mainly improved through the following aspects: a) Replace local variables with register variables. In C language, local variables are often used in subroutines and functions to temporarily store data. When the program is running, the compiler allocates temporary memory space for all declared local variables. The access operation of local variables involves memory access, and the speed of memory access is very slow compared to register access. Therefore, the data registers and pointer registers in the system can be used to replace local variables that only serve as temporary storage, thereby greatly saving the time delay caused by system memory access. However, since the number of registers in the system is quite limited for local variables, registers must be used reasonably and efficiently. b) Replace software loops with hardware loops. Software loops refer to setting judgment conditions at the beginning or end of loops such as for or while to control the start, continuation, and end of the loop. The conditional judgment instructions of the software loop will dynamically select branches. Once a jump occurs, it will block the pipeline, and keeping the pipeline unblocked is the key factor to maintain efficient operation. The Blackfin processor has dedicated hardware to support two-level nested zero-overhead hardware loops. This method does not require judgment of conditional transfers. The DSP hardware automatically executes and ends the loop according to the predetermined number of loops, thereby ensuring the smooth flow of the pipeline and improving the speed. c) Make full use of the data bus width. The external data bus width of Blackfin533 is 32 bits, and 4 bytes can be accessed at a time. Therefore, making full use of the total data access width, especially when operating a large amount of data, maintaining 4 bytes at a time can reduce the number of instruction cycles, thereby improving the execution speed. d) Efficient use of parallel instructions and vector instructions. Parallel instructions and vector instructions are a major feature of the Blackfin series DSP. By using parallel instructions, the advantages of the Blackfin processor's SIMD system structure and the parallel processing capabilities of hardware resources can be fully utilized to reduce the number of instructions, thereby improving program execution efficiency. Often, through the reasonable arrangement of the program, one parallel instruction can be used to replace two or three non-parallel instructions. Vector instructions make full use of the instruction width and perform the same operation on multiple data streams at the same time. For example, if two 16-bit arithmetic or shift operations are to be performed, they can be completely implemented through a 32-bit vector instruction, thereby achieving the original two-cycle work with one clock cycle. For example, R3=abs R1(V) uses one instruction cycle to simultaneously implement the absolute value operation of two 16-bit data. e) Reasonable allocation of data storage space. Due to the access speed and capacity characteristics of the DSP on-chip and off-chip data storage space, the on-chip space has a fast access speed but a small capacity, while the off-chip space is large but has a slow access speed. Therefore, it is critical to reasonably allocate data storage locations to improve the running speed of the program. For data with high frequency of use, try to put it in the on-chip space, and put infrequently used data in the off-chip space. If you want to access data located off-chip, you should try to arrange the data to be accessed in a continuous distribution, read a large block of off-chip data into the on-chip cache at one time, and avoid the time waste caused by frequent reading of off-chip data. 4 Results of optimization implementation The method to test the optimization effect is to add the deblocking filter C program module in the reference code JM8.6 to the original decoder for testing, and compare it with the test cycle of the deblocking filter assembly program module optimized at three levels: system, algorithm, and assembly. The selected test image sequences are Clarie.cif, Paris.cif, and Mobile.cif. The test data is shown in Table 1.
As can be seen from Table 1, compared with the C program code in JM8.6 before optimization, the efficiency of the optimized deblocking filter assembly module has increased by about 7 times. 5 Conclusion This paper implements the deblocking filter function in H.264 through optimization at three levels: system, algorithm and assembly. In particular, it achieves good optimization results by improving the implementation algorithm of deblocking filter, classifying the macroblocks to be filtered, and making full use of assembly-level optimization methods such as parallel instructions and vector instructions. The optimized deblocking filter module, based on the original H.264 decoder, filters a 25-frame image sequence of about 400 kbit/s, which requires about 250 MHz clock cycle, while the total cycle of the decoder is about 700 MHz clock cycle, so that the decoding speed of the decoder reaches about 20 frames/s, which basically meets the requirements of quasi-real-time decoding. This implementation method is well optimized compared to the reference module, but through the time-consuming analysis of the program, there is still room for further improvement in reading the data to be filtered and rewriting the filtered data, the GetBs function for obtaining the BS value, and the EdgeLoop function for filtering. For the interaction of data inside and outside the chip, DMA technology can be used to read and write data while filtering, thereby offsetting the clock cycles consumed by data movement; there is still room for further improvement in the efficiency of the assembly code implementation in GetBs and EdgeLoop; these two aspects are also the next direction of improvement.
Previous article:Low Voltage Transconductance Operational Amplifier for Sigma-Delta Modulator
Next article:Design of mixer local oscillator circuit based on ADF4360_4
Recommended ReadingLatest update time:2024-11-16 17:56
Detailed explanation of the limitations of H.264 wireless network cameras are beginning to emerge
In the rapid development of the digital video application industry chain, facing the trend of video applications developing towards high definition, high frame rate and high compression rate, the limitations of the current mainstream video compression
standard
protocol H.264
network camera
have begun to e
[Security Electronics]
Recommended Content
Latest Analog Electronics Articles
- High signal-to-noise ratio MEMS microphone drives artificial intelligence interaction
- Advantages of using a differential-to-single-ended RF amplifier in a transmit signal chain design
- ON Semiconductor CEO Appears at Munich Electronica Show and Launches Treo Platform
- ON Semiconductor Launches Industry-Leading Analog and Mixed-Signal Platform
- Analog Devices ADAQ7767-1 μModule DAQ Solution for Rapid Development of Precision Data Acquisition Systems Now Available at Mouser
- Domestic high-precision, high-speed ADC chips are on the rise
- Microcontrollers that combine Hi-Fi, intelligence and USB multi-channel features – ushering in a new era of digital audio
- Using capacitive PGA, Naxin Micro launches high-precision multi-channel 24/16-bit Δ-Σ ADC
- Fully Differential Amplifier Provides High Voltage, Low Noise Signals for Precision Data Acquisition Signal Chain
MoreSelected Circuit Diagrams
MorePopular Articles
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
MoreDaily News
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
Guess you like
- 2. Evaluation starts with lighting
- Slope compensation issues
- Controlled by "Playpad" + Siri + Shortcut Command + ESP8266
- Share the OS transplantation and application of Lingdong Micro MCU based on MM32 MCU-AMetal SPI operation
- Do you still buy electronic components offline? It’s Double 11, what goodies are in your shopping cart?
- 【TI Recommended Course】#Amplifier Protection Series#
- Battery and System Health Monitoring Reference Design for Battery-Powered Smart Flow Meters
- Who has used the voltage-to-PWM chip GP9101?
- Analog Multiplexer and Switch Configuration
- Build a "core" building together: front-line engineers talk about national chips, valid all year round (11 floors have been built)