Vectorization of H.264 decoder based on Loongson 3B-EEWORLD

Collect

0 Introduction

Today's society has entered the information age, and traditional information carriers and communication methods can no longer meet people's demand for information. Experiments have shown that compared with voice and abstract data, humans receive more information in the form of pictures and videos. Among them, video information is intuitive, specific and efficient, which determines that video communication technology will become one of the important technologies in the information age.

Since the amount of video data is huge, and the resources for storing videos are usually very limited, it is necessary to compress and encode the video to reduce the consumption of storage resources. However, in general, the higher the complexity of the compression algorithm used, the higher the compression ratio, and the lower the decoding speed during video playback. Therefore, while improving the encoding compression rate, it is also necessary to optimize the decoder accordingly to improve the performance of the video decoder on the target platform. This article implements the transplantation and vectorization of the ffmpeg decoder on Loongson 3B, improving the performance of the decoder on Loongson 3B.

1 Video Encoding/Decoding and Loongson 3B

1.1 Video Encoding/Decoding

At present, there are many mature compression coding/decoding methods. Among them, H.261, MPEG-1, MPEG-3 and H.263 use the first generation of compression coding methods, such as predictive coding, transform coding, entropy coding and motion compensation; while MPEG-4 and H.264 use the second generation of compression coding methods, such as segment coding and model-based or object-based coding.

The main purpose of video compression coding is to reduce the resources occupied by video storage, while the goal of decoding technology is to increase the decoding speed, thereby improving the smoothness of video playback. Common soft decoders based on H. 264 encoding methods include CoreAVC, ffmpeg and JM. Among them, JM is the codec provided by the H. 264 official website, which combines various coding/decoding algorithms, and the code structure is clear, which is very suitable for the research of video coding/decoding technology. The CoreAVC decoder is mainly used for commercial purposes, and its decoding rate is more than 50% faster than ffmpeg. ffmpeg is an open source decoder with relatively good performance. Many open source projects directly or indirectly use ffmpeg, such as mplayer player. Through comprehensive consideration of performance and open source characteristics, this article chooses ffmlpeg as the transplantation and vectorization object.

1.2 Loongson 3B architecture

While being compatible with the MIPS64 instruction set, the Loongson 3B processor also implements vector extension instructions for multimedia applications, which greatly helps improve the performance of video encoding/decoding applications.

Loongson 3B provides 256-bit vector registers and implements vector extension instructions including 256-bit vector access registers. Using vector instructions, operations on 32-byte width data can be completed at one time. Such a structure and instruction set design make Loongson 3B very suitable for implementing large-scale operations on the same type of data, such as matrix multiplication and FFT operations, as well as video encoding/decoding operations.

However, since ffmpeg does not support the Loongson 3B platform, it is necessary to complete the porting of ffmpeg to Loongson 3B. Before this article, there were some porting work of ffmpeg to other platforms and porting and optimization work for the Loongson platform, all of which achieved good results.

2 ffmpeg porting based on Loongson 3B

2.1 ffmpeg porting

The ffmpeg decoder provides support for different target platforms, and the files related to these platforms are stored in directories named after the target platform. For example, the ffmpeg decoder implements support for the arm and sparc platforms, as well as the x86 platform.

To implement ffmpeg decoder support for Loongson 3B, the following five steps are mainly completed:

(1) Modify the configure configuration file and add configuration options related to the Loongson architecture;

(2) Create a new Godson folder for Loongson and store all files related to the Loongson architecture in this folder;

(3) Add the newly added files that need to be compiled in the godson folder to the Makefile;

(4) Added a new initialization function dsputil_init_godson similar to dsputil_init;

(5) Add the declaration of the new function in the header file.

The ffmpeg porting work for Loongson 3B is relatively simple, so this article focuses on the vectorization work for Loongson 3B.

2.2 Performance comparison of transplanted ffmpeg

This section tests the performance of the ported ffmpeg decoder and compares the performance of the two cases with and without the Godson 3B vector extension instructions. The test was compiled using the GCC compiler that supports the Godson 3B extension instruction set, and the -ftree-vectorize and -march=godson3b compilation options were enabled to support the Godson 3B extension instructions. The test case used was the video "walk_vag_640x480_qp26.264", and the test results are shown in Table 1.

From the test results in Table 1, we can see that the use of the vector extension instructions of Loongson 3B can improve the performance of the ffmpeg decoder on Loongson 3B, and the decoding time of the video used for the test is reduced by about 466s. However, due to the limitation of the automatic vectorization capability of the GCC compiler itself, the performance improvement of the ffmpeg decoder is still relatively limited. Therefore, vectorizing the transplanted ffmpeg decoder for the instruction set of Loongson 3B has become an important task to further improve the performance.

3 ffmpeg vectorization

3.1 ffmpeg oprofile test

The process of ffmpeg decoding the video "Wendao Wudang 002.mkv" was tested using oprofile, and the test results are shown in Table 2. Table 2 lists the calling process of each function and the proportion of running time. The vectorization work for the ffmpeg decoder is mainly aimed at the vectorization of several functions with longer execution time and larger running proportion in the oprofile test results.

[page]

The execution time of the above functions accounts for almost 60% of the execution time of the ffmpeg decoder. Therefore, vectorizing the above functions can completely achieve the purpose of improving the overall decoding speed of ffmpeg.

3.2 ffmpeg vectorization for Loongson 3B

3.2.1 Vectorization Method

The main method of implementing the vectorization of the ffmpeg decoder on Loongson 3B is to use the extended vector instructions of Loongson 3B to improve several functions with a large execution time in the oprofile test results in Section 3.1. In addition, while vectorizing, some optimization strategies can also be used to improve the performance of the vectorized functions. The main optimization methods used include:

(1) Loop unrolling. Loop unrolling is a loop transformation technique that copies the instructions in the loop body multiple times, increases the amount of code in the loop body, and reduces the number of loop repetitions. It should be noted that loop unrolling itself cannot directly improve program performance.

The main purpose of loop unrolling is to fully exploit the parallelism between instructions or data. The use of vector extension instructions takes advantage of the parallelism of data in the unrolled loop, while the use of instruction scheduling and software pipelining in the unrolled loop is to fully exploit the parallelism between instructions.

(2) Instruction scheduling. After loop expansion, the number of instructions in the loop body increases, so instruction scheduling can be performed to schedule instructions that have no operand dependencies and no computing unit dependencies together. This can give full play to the pipeline performance of Loongson 3B, thereby improving the execution speed of the code on Loongson 3B.

In addition to using the vector extension instructions of Loongson 3B and the above two optimization methods, other optimization methods can also be used according to the characteristics of specific functions, such as using logical operations and shift operations to replace multiplication operations. The vectorization optimization for each function is introduced in Section 3.2.2.

3.2.2 Vectorization of Specific Functions

Section 3.2.1 outlines some optimization methods used in vectorization. This section will focus on optimizing several functions that are important in the oprofile test.

For the functions in Table 2, we can classify them according to the function names. Functions with similar names can basically use similar optimization methods.

(1) Simple vectorization. For the optimization of functions 1 and 2, this paper adopts the strategy of using shift operations instead of multiplication operations, and uses saturated vector operations to improve the bounded characteristics of the internal operations of the loop. However, for the memory access operation of function 2, due to the existence of data misalignment, additional vector instructions are used to pack and write back the data. Function 3 is a mixture of functions 1 and 2, so the optimization of functions 1 and 2 indirectly improves the performance of function 3.

As for functions 4, 5, and 6, this paper only uses loop unrolling and instruction scheduling strategies for their inner loops to achieve good computing results.

Similarly, functions 11 and 12 can also be vectorized relatively intuitively, which will not be described in detail here.

(2) Indirect vectorization. For functions 7 and 8, which are more difficult to vectorize, this paper adopts the strategies of using masks and matrix transposition operations to indirectly achieve vectorization.

There are many judgment statements in the C language implementation of the h264 v loop filter luma function. This article uses the method of constructing a mask to eliminate these judgment statements.

The construction of the mask is introduced by taking the loop in Figure 1(a) as an example. Figure 1(b) shows the vector instruction that replaces the loop. The specific operation result is shown in Figure 1(c): saturated subtraction is performed on the p vector (array) and the q vector (negative results are set to 0), and the result vector is shown as Vsub. Use Vsub to compare with the zero vector to set the mask: if the result is true, the mask value is 0xFF; otherwise, if the result is false, the mask value is 0. Finally, the mask value is ANDed with the p vector to get the operation result of the loop.

Using the mask construction method to eliminate the judgment statement not only reduces the time overhead caused by the judgment, but also indirectly vectorizes the loop and improves the function performance. The same method can be used to improve functions 9 and 10.

As for function 8, since the operation is to process continuous data, it cannot be vectorized. By using the matrix transposition method, the data can be repackaged and the corresponding vector operation can be performed.

For the operation in Figure 2(a), the original calculation is an operation inside the P vector, so it cannot be vectorized. We use vector instructions to transpose the p vector to q, where q0 stores the data labeled 1 in p, q1 stores the data labeled 2 in P, and so on. The transposed q vector can be operated using the vector instructions in Figure 2(b), and the result is the same as the original operation.

[page]

The above transposition method is also used for the optimization of functions 13 to 15. The test results in Section 4.1 illustrate the optimization effects of each function.

4 Experimental Results

4.1 ffmpeg function speedup ratio

This paper tests each function after vectorization and compares it with the function before vectorization. The speedup ratio of each function after vectorization optimization is shown in Figure 3. The function numbers shown on the horizontal axis in the figure correspond to the functions in Table 2.

The speedup ratios of the functions in Figure 3 span a wide range. For example, the speedup ratio of function No. 6 is about 23.9, while the speedup ratio of the last function is only about 1.2. The reason for the above situation is not only related to the number of vector instructions used by the improved function and the proportion of modified code, but also related to the type of operands used in the operation. For function No. 6, the type of operands used in the operation in its loop is byte type, so only using vector instructions for optimization, the theoretical speedup ratio can reach 32. However, this paper only vectorizes the inner loop of the function, and the inner loop after vectorization only processes 16 byte type data at a time, that is, it does not fully use the 256-bit vector register. Therefore, the theoretical speedup ratio should be 16, but due to the combination of other optimization strategies such as loop unrolling and instruction scheduling, the actual speedup ratio can reach about 23.9. Similarly, by analyzing the three functions of the same type, No. 4, No. 5 and No. 6, we can also find that the speedup ratio of the latter function is about twice that of the former function. This is because for function No. 4, the inner loop can calculate 4 bytes of data at a time after vectorization, while function No. 5 can calculate 8 bytes of data at a time. Therefore, the theoretical acceleration ratio should also be twice the geometric progression, and the actual results are consistent with the theoretical analysis.

For function 7 and function 8 introduced in Section 3.3.2, their original functions cannot be simply vectorized. This paper uses optimization methods such as mask and matrix transposition to enable them to use the vector extension instructions of Loongson 3B. Therefore, although the performance improvement is not significant, the speedup ratio is 3.2 and 5.5 respectively.

4.2 Comparison of vectorization on different platforms

This article also tests the ffmpeg decoder on different platforms. The two test videos used are "Wendao Wudang 002.mkv" (video A) and "walk_vag_ 640x480_qp26.264" (video B). Video A is a clip from the "Wendao Wudang" video (720p), while the latter is generated by encoding walk_vag.yuv (480p) with x264, and the qp value selected during encoding is 26. The test platforms are AMD and Intel processor platforms respectively.

From the test results in Table 3, we can see that for video A, the performance improvement on Loongson 3B is much higher than that on the other two platforms; and for video B, the performance improvement on Loongson 3B is also close to that on the other two platforms. The experimental results show that the vectorization of the ffmpeg decoder on Loongson 3B is of great help to the performance improvement, and when decoding some videos, the performance improvement is even higher than that of commercial processors with superior performance. By comparing with the results of GCC vectorization compilation in Table 1, we can also see that manually vectorizing the ffmpeg decoder has a greater performance improvement than using GCC vectorization.

5. Summary and Outlook

This paper implements the transplantation of ffmpeg decoder to Loongson 3B, and implements manual vectorization of ffmpeg decoder based on the support of vector extension instructions of Loongson 3B. The experimental results show that the performance of the manually vectorized ffmpeg decoder is much better than that of the ffmpeg decoder compiled with GCC vectorization, and the performance improvement is also greater than that of Intel and AMD platforms.

This article only implements the vectorization transplantation of ffmpeg decoder for Loongson 3B from the code level. In order to further improve the performance, it is necessary to optimize the entire algorithm level. In addition, due to the multi-core characteristics of Loongson 3B, it is also possible to consider using multiple cores for decoding.

Keywords：H.264 ffmpeg decoder Godson3B Reference address：Vectorization of H.264 decoder based on Loongson 3B

Previous article：Design of digitally controlled DC current source system based on microcontroller
Next article：Coding Optimization of CCSDS Image Compression Algorithm Based on Blackfin533

Popular Resources
Popular amplifiers