High-speed JPEG decoding algorithm based on multi-core processor architecture-EEWORLD

Collect

　　JPEG (Joint Photographlc Experts Group) is a static image data compression standard with a wide range of applications. It is currently widely used in image processing in cameras, printers, etc. In these applications, designing a high-speed and efficient JPEG decoder has become an important research direction. With the increasing requirements for real-time, high performance and scalability of embedded systems, the application of multi-core embedded processors is increasing.

1 JPEG decoding algorithm principle

　　JPEG compression is a lossy compression. It uses the characteristics of the human visual system and uses a combination of quantization and lossless compression coding to remove the redundant information of the visual angle and the redundant information of the data itself to achieve the purpose of compression. The JPEG algorithm can be divided into basic JPEG (Baseline system) and extended JPEG (Extended system). Among them, the Baseline system is particularly widely used. This article mainly discusses the decoding of the Baseline system. The block diagram of the JPEG decoding algorithm is shown in Figure 1.

　　(1) Color space transformation
　　The JPEG algorithm itself has nothing to do with color space, so "RGB to YUV transformation" and "YUV to RGB transformation" are not included in the JPEG algorithm. However, since the bitmap data as output generally requires RGB representation, the color space transformation is also shown in the algorithm block diagram.

　　(2) JPEG encoding and decoding unit
　　In JPEG, the encoding and decoding of images is performed in blocks. The entire image is divided into several 8×8 data blocks, called minimum coding units (MCUs). Each block corresponds to an 8×8 pixel array of the original image; the encoding and decoding order of each row is from top to bottom, and the encoding and decoding order within a row is from left to right.

　　It is worth noting that since the height and width of an image are not necessarily integer multiples of the MCU size, it is necessary to fill the rightmost column or the bottommost row of the image to expand its height or width so that the entire image can be divided into an integer number of MCUs; and when decoding and outputting, these duplicated rows and columns are to be discarded.

　　(3) Entropy decoder:
　　When entropy encoding JPEG, the DC value of each block is first differentially encoded using spatial correlation, that is, the DC difference between adjacent blocks is encoded to achieve the purpose of compressing the code length. Then, the elements in the block are scanned in ZigZag mode for the AC part, and a hybrid encoding method of first run-length encoding and then Huffman encoding is used for the elements in the block to obtain a one-dimensional binary block code stream. The entropy encoding process consists of differential encoding of the DC part and ZigZag scanning, run-length encoding, and Huffman encoding of the AC part. The corresponding entropy decoding process is the inverse process of encoding. What is received at the decoding end is a data stream consisting of variable length code (VLC) and variable length integer (VLI). In order to recover the DCT coefficients before encoding from this data stream, it is necessary to generate a Huffman decoding table based on the principle of Huffman coding and the details of the generation of code tables at each level, and then restore the DC and AC coefficients of DCT according to the decoding algorithm.

　　(4) Inverse quantization:
　　At the JPEG decoding end, the quantization table sent over is used to decode the quantized value. JPEG files generally contain two quantization tables: one for the luminance component and one for the chrominance component. Dequantization is to multiply the coefficient matrix obtained by entropy decoding by the corresponding quantization matrix:

　　Where C(u, v) represents the entropy decoding output, and Q(u, v) represents the corresponding quantization matrix.

　　(5) IDCT. Whether the transform JPEG decoding algorithm can meet the requirements of real-time applications depends on the calculation speed of the 8×8 two-dimensional IDCT. In the encoding stage, the forward discrete cosine transform (FDCT) transforms the image represented in the spatial domain into an image represented in the frequency domain; correspondingly, in the decoding stage, the inverse discrete cosine transform (IDCT) transforms the image represented in the frequency domain into an image represented in the spatial domain.

　　There are currently many algorithms for implementing IDCT. The traditional method is the row-column method, that is, first perform a one-dimensional IDCT calculation on each row (column), and then perform a one-dimensional IDCT calculation on each column (row). There are also polynomial transformation methods and trigonometric function formula methods. The number of additions in these two methods is equivalent to that of the row-column method, and the number of multiplications is only half of that of the row-column method. However, the problem with these two methods is that the implementation methods are complicated. For the target platform (VLIW), such a structure is difficult to improve the parallelism of instructions. In addition, for the target platform, the execution time of multiplication instructions is equivalent to that of addition instructions, thus reducing the cost of row-column implementation.

2 Multi-core processor architecture

　　FR1000 is a multi-core processor produced by Fujitsu and mainly used in embedded systems. FR1000 integrates four processor cores (processor elements) on one chip, and each processor core shares memory and other external devices. The four processor cores are called PM (PE0), PE1, PE2, and PE3. Among them, each processor core is an independent VLIW (very long instruction set) architecture processor, and each processor core is equipped with an independent high-speed secondary cache to reduce the bottleneck effect of multi-core processors when accessing memory in parallel. The hardware structure of the FR1000 multi-core processor is shown in Figure 2.

　　FR1000 runs an independent real-time operating system (RTOS) on each processor core, and each processor core communicates with each other through the parallel extension library (MP extended library). Through the extension of the parallel extension library, tasks running on a processor core can not only communicate with tasks running on the same processor core, but also with tasks running on different processor cores. In this way, tasks can complete specific applications through collaborative communication. By dividing the application into different tasks that can run in parallel and running them on different processor cores, data can be processed in parallel, thereby achieving the purpose of improving system performance. The structural

　　block diagram of the FRl000 system is shown in Figure 3.

　　From the architecture of FR1000, we can see that in order to improve the processing speed of graphics and multimedia data, the processor focuses on expanding its ability to process data in parallel. Such expansion is mainly manifested in two aspects: on the one hand, the processor core of VL1W architecture is used (such a processor core can execute up to 8 instructions in parallel at a time, and such parallelism is mainly supported by the compiler, which is a kind of instruction parallelism); on the other hand, the architecture of multi-core processor (CMP) is used to enable the tasks divided for applications to run in parallel on multiple processor cores. (Such parallelism needs to be supported by the application and realized by appropriate task division) [page]

3 Implementation of JPEG decoding algorithm on multi-core processor

　　According to the characteristics of FR1000 processor, the decoding of JPEG images needs to be divided into appropriate tasks that can be executed in parallel. A more intuitive idea is to divide the JPEG image into 4 parts and decode them on 4 processor cores respectively. However, since the data stream of JPEG images is variable-length coded, it is difficult to divide it into 4 images that can be decoded in parallel according to the existing data stream. (Such a division costs too much time)

　　According to the JPEG image decoding principle described above, it can be seen that the basic unit of decoding is MCU, so the MCU generated after the first step of entropy decoding is the smallest unit that can be decoded in parallel. Therefore, the key to parallel decoding of a JPEG image on a multi-core processor is to distribute the MCU contained in the JPEG image to each processor core in a load-balanced manner for parallel decoding.

　　Therefore, there are two processing methods: one is to use an MCU as the unit of task allocation, generate MCU by PM through entropy decoding, and then evenly distribute MCU to each processor core (PE), and each processor core writes it to the corresponding position of the bitmap after completing the decoding of MCU. There are two advantages of doing this: ① It can achieve good load balancing, so that each processor core bears almost the same load. ② It can make entropy decoding and MCU decoding proceed in parallel. But a big problem with this approach is that the time cost consumed by communication between processor cores is too high. Because this can be abstracted as a producer and consumer model, the producer needs to communicate with the consumer or update the data input of the consumer end every time it produces an MCU. After actual testing, it was found that the communication overhead caused by this approach was too large, accounting for more than 20% of the running time of the decoding program. Another problem with this approach is the reading and writing of memory. Since each processor core needs to write to the same area of memory in an interlaced manner, the write-back mode cannot be used for writing to this block of memory. Because if each processor core uses the write-back mode, the data in the cache of each processor core will be inconsistent with the data in the memory, resulting in errors.

　　Another processing method is to divide the image blocks. Since the MCU corresponds to the original bitmap from top to bottom and from left to right, the JPEG image is divided into 4 image blocks according to the height, and the height of such an image block must be an integer multiple of the MCU. Then each processor core decodes each image block separately, and writes the decoding results in the specified memory area to splice into a complete original bitmap.

　　The key to this processing method is how each processor core can quickly locate the part of the image block that it needs to decode. Since JPEG is a variable-length code, there is no O(1) algorithm that can locate it through a certain offset, but the code of the entropy decoding part can be modified so that it can skip unnecessary decoding and quickly locate the area that needs to be processed. Specifically, the positioning process is actually the process of counting MCUs. There is no need to save the content of the MCU during positioning, only the decoded MCUs need to be counted. Due to the one-to-one correspondence between MCUs and the original bitmap, the area that needs to be processed can be located by counting the MCUs. The specific conversion formula is as follows:

　　Where Height represents the image height, Width represents the image width, and PEID represents the number (1 to 4) corresponding to each processor core.

　　The decoding process of each processor core is shown in FIG4 .

　　The problem with this processing method is that each processor core needs to spend extra time to locate the data block to be decoded; but after actual measurement, it was found that the time consumed by the positioning operation is only about 5%. Because on the FR1000 platform, a large amount of decoding time is consumed mainly in IDCT transformation and YUV to RGB color space conversion. This processing method reduces the time consumption of communication. In the decoding of a JPEG image, only two communications between processor cores are required. Another advantage of this processing method is that each PE can use the write-back mode for memory writing when writing the bitmap data of the result. Only the cache refresh operation is performed at the intersection of the image block to ensure the correctness of the result. In the subsequent discussion on optimization, it can be seen that this method plays a very important role in improving the decoding speed.

4 Optimization

　　Generally speaking, the running time of a program on a multi-core processor divided by the running time on one of the single-core processors is called multi-core parallelism (MP). On the FR1000 processor with 4 processor cores, the limit value of MP (there is necessary communication overhead) should be 25%. But according to the decoding process in Figure 3, the measured MP is only about 43%. After further analysis, it was found that since multiple processor cores decode along the same process, there are a large number of concurrent read and write operations on the memory at the same time, and such concurrent operations cause the read and write of memory to become the bottleneck of the system. On a single core, it takes 16 to 20 cycles to read a row of cache, and when multiple processor cores are running simultaneously, it takes 30 to 40 cycles to complete.

　　The optimization is mainly carried out from two aspects:

　　① Minimize the read and write operations on the memory. The general JPEG decoding program will save the intermediate results after entropy decoding in units of lines, that is, use the space for storing 1 row of MCU as a temporary buffer. Such a temporary buffer area increases with the increase of the width of the image line. When the width of the image increases to a certain extent, such a temporary cache will probably be too large to reside in the cache, and the cache will not hit, resulting in a large number of memory reads and writes and cache replacement. After optimization, it is changed to perform inverse quantization, IDCT and color space transformation immediately after entropy decoding of one MCU until it is written into the bitmap. In this way, only a temporary cache of the size of one MCU is required. It can ensure that such a buffer is always stored in the cache, thus avoiding a large number of memory read and write operations. However, this method requires appropriate judgment of boundary conditions. As mentioned above, since the length and width of the image are not necessarily integer multiples of the MCU, there is padding data in the bottom row and the rightmost column, which needs to be discarded during decoding.

　　② Properly select the memory read and write mode. Since in the entire decoding program, a large number of memory write operations are required when the bitmap is finally written. If the write-through mode is used, the cache and memory are written at the same time each time, which will inevitably cause a large number of memory read and write operations. Therefore, the write-back mode is used in the area where the bitmap is written, so that only the memory needs to be written each time the cache line is replaced, which greatly reduces the memory read and write operations. However, it should be noted that in the environment of multi-core processors, the consistency between the memory area and the cache data on each processor core must be guaranteed. This requires the memory read and write areas of each processor core to be properly divided, and the corresponding cache lines are refreshed with instructions when reading and writing the boundaries of each area.

　　It is worth noting that in the architecture of multi-core processors, since multiple processors will access the memory in parallel, the memory can easily become a bottleneck, which is particularly prominent in image processing programs involving a large number of memory operations. Therefore, the optimization of the program should focus on optimizing the reading and writing of the memory.

5 Experimental Results

　　Three JPEG images of 256×256, 1024×1024, and 4096×4096 were selected for decoding, and the number of cycles consumed is listed in Table 1.

　　It can be seen that for larger images, the MP is closer to the limit value of 25%, because the communication overhead is smaller at this time; at the same time, as the memory block increases, the cost of refreshing the cache line at the boundary of the image block processed by each processor core is also smaller, and the average MP is about 28%.

6 Conclusion

　　According to the characteristics of the multi-core processor architecture, a high-speed JPEG decoding algorithm is implemented on it, and its multi-core parallelism (MP) is close to the limit value of 25%. Although the above implementation is only for the FRl000 multi-core processor, it is also suitable for other processors with multi-core architecture. In addition, the optimization method for the multi-core processor architecture also has certain reference value for other applications running on the multi-core processor architecture.

Keywords：JPEG Reference address：High-speed JPEG decoding algorithm based on multi-core processor architecture

Previous article：Image Acquisition and Processing System Based on ADSP-TS201S
Next article：New bus design for SoC microcontrollers in high-bandwidth embedded applications

Popular Resources
Popular amplifiers

Latest Microcontroller Articles

Download from the Internet--ARM Getting Started Notes
A brief introduction: From today on, the ARM notebook of the rookie is open, and it can be regarded as a place to store these notes. Why publish it? Maybe you are interested in it. In fact, the reason for these notes is ...
Learn ARM development(22)
Turning off and on interrupts Interrupts are an efficient dialogue mechanism, but sometimes you don't want to interrupt the program while it is running. For example, when you are printing something, the program suddenly interrupts and another ...
Learn ARM development(21)
First, declare the task pointer, because it will be used later. Task pointer volatile TASK_TCB* volatile g_pCurrentTask = NULL;volatile TASK_TCB* vol ...
Learn ARM development(20)
With the previous Tick interrupt, the basic task switching conditions are ready. However, this "easterly" is also difficult to understand. Only through continuous practice can we understand it. ...
Learn ARM development(19)
After many days of hard work, I finally got the interrupt working. But in order to allow RTOS to use timer interrupts, what kind of interrupts can be implemented in S3C44B0? There are two methods in S3C44B0. ...
Learn ARM development(14)
Learn ARM development(15)
Learn ARM development(16)
Learn ARM development(17)

He Limin Column Microcontroller and Embedded Systems Bible

Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.

MoreSelected Circuit Diagrams

Change More Related Popular Components

MorePopular Articles

MoreDaily News

Guess you like