Research on JPEG2000 Core Algorithm and Its DSP Implementation
[Copy link]
1 Introduction
JPEG2000 is a new generation of international standard for still image compression. It has excellent image compression performance and high image quality. It not only overcomes the disadvantage of the traditional JPEG still image compression standard of block effect at high compression, but also provides features such as progressive image transmission, scalable image quality and region of interest coding. It can be applied to digital cameras, medical images, network transmission, etc.
2 Basic principles of JPEG2000 standard
2.1 JPEG2000 codec framework The
JPEG2000 encoder mainly includes preprocessing, wavelet transform, quantization and entropy coding. Compared with the encoding process, the decoding process of the system is relatively simple [1]. The block diagram of JPEG2000 codec is shown in Figure 1 and Figure 2.
|
Figure 1 JPEG2000 encoder block diagram
|
|
Figure 2 JPEG2000 decoder block diagram
|
2.2 The core algorithm of JPEG2000 encoding
1) DWT transformation
Through the multi-level wavelet decomposition of discrete wavelet transform, the wavelet coefficients can represent both the high-frequency information of the local area in the image slice and the low-frequency information in the image slice. In this way, even at low bit rates, more image details can be maintained. In addition, the resolution of the image represented by the coefficients obtained by the next level of decomposition in the horizontal and vertical directions is only half of the image represented by the wavelet coefficients of the previous level. Therefore, by decoding different levels of the image, images with different spatial resolutions can be obtained.
2) EBCOT algorithm
The basic idea of the EBCOT algorithm is to divide the subband after wavelet transform into code blocks of fixed size, quantize the code block coefficients, and perform three-channel scanning modeling (importance propagation coding channel, amplitude refinement coding channel, and clearing coding channel) on all wavelet coefficient bits on each bit plane in turn according to the binary bit layering method, starting from the high-significant bit plane, i.e., bit plane coding, to generate context and 0, 1 symbol pairs, and then perform context arithmetic coding on these context and symbol pairs to form a code block code stream, completing the first stage of coding block coding; finally, according to certain parameter indicators such as bit rate and distortion, according to the rate-distortion optimal principle, the appropriate bit stream is intercepted from each independent code block code stream to assemble into the final image compression code stream, completing the second stage of code stream assembly process [2].
2.3 Improvement and implementation of block coding algorithm in EBCOT algorithm
In the JPEG2000 encoding and decoding system, the EBCOT algorithm is an important component. The first stage block coding in the EBCOT algorithm is the core of the entire algorithm and takes up a lot of coding time. Regardless of lossless compression or lossy compression, the bit plane coding time in the EBCOT algorithm accounts for more than 50% of the total coding time [3][4]. Therefore, since the EBCOT algorithm was proposed, due to the large amount of computation and slow coding speed of the first stage block coding, it is necessary to optimize and improve this situation.
Figure 3 is a schematic diagram of the change in the number of coefficients encoded in the three channels of the Barbara image (256×256) during bit plane coding. In the figure, channel 1 represents the importance channel, channel 2 represents the amplitude refinement channel, and channel 3 represents the clearing channel. It can be seen from the figure that in the highest bit plane MSB, all coefficients are only encoded in the clearing channel. The number of coefficients encoded in the importance propagation channel first increases, and then because the coefficients in the importance propagation channel have become important, the number of coefficients encoded in the importance propagation channel gradually decreases. In the low bit planes (0, 1, 2), most of the coefficients are encoded in the amplitude refinement channel, only a small part is encoded in the importance channel, and no coefficient is encoded in the clearing channel. During the entire scanning and encoding process, three scans are required to form three contexts and determine the encoding channels to which they belong, which will significantly increase the encoding time.
|
Figure 3 Schematic diagram of the change in the number of coefficients encoded in the three channels of bit plane coding
|
Based on the data analysis of Figure 3, this paper proposes two improved methods for bit-plane coding [5].
(1) Omitting the coding of the clearing channel of bit planes 0, 1, and 2. As shown in Figure 3, the number of pixels actually coded on the clearing channel of the lower bit planes (i.e., bit planes 0, 1, and 2) is very small, almost zero. Therefore, it is meaningless to spend time scanning and coding the clearing channel of the lower bit planes. This paper proposes an improved solution, namely, omitting the coding of the clearing channel of the lower bit planes, so as to achieve the purpose of improving the standard algorithm. The coding module is similar to the code of this part of the standard algorithm, except that the coding conditions are changed.
(2) One-time scanning method for bit planes 6 and 7. As shown in Figure 3, the coding amount of the importance propagation channel and the amplitude refinement channel of the high bit planes (bit planes 6 and 7) is very low, approaching zero, while the number of pixels coded by the clearing coding channel is very high. Under the scanning mechanism of the standard algorithm, the high-frequency sub-band pixels that have little impact on image quality are coded, and the scanning algorithm must scan three times from the highest plane to the lowest. This paper proposes a one-time scanning method to improve the higher bit planes, that is, all coefficients of the highest bit plane and the second highest bit plane are coded in one scanning process. When encoding a coefficient at a time, the context is formed to determine which channel the coefficient belongs to. Then, the coefficient is immediately encoded according to the channel to which it belongs. In this way, two scans can be reduced, encoding time can be saved, and encoding efficiency can be improved. The object processed in this paper is mainly 8-bit grayscale images. Lossy compression uses 9/7 wavelet transform.
Through the study of compression performance, it is found that when the compression ratio is small, the compression performance of the improved algorithm in this paper is about 0.4db lower than that of the standard algorithm. When the compression ratio is large, the compression performance of the two is consistent, retaining the excellent compression performance of JPEG2000; from the perspective of encoding and decoding time, in terms of lossy compression encoding execution time, the improved algorithm given in this paper is 8% to 12% shorter than the standard algorithm, and the decoding time is shortened by 2% to 5%, which improves the encoding efficiency and achieves the purpose of improvement.
3 DSP implementation of the improved algorithm in the JPEG2000 standard
3.1 DSP hardware development platform
The evaluation board used in this paper is the TDS642 of Beijing Wenting Company. The DSP chip on the board is TMX DM642, BGA548 package, internal working clock is 600M, external bus clock is 100M, and computing power is up to 480 million instructions per second.
The platform provides a wealth of peripheral interfaces. The board has two composite video (PAL/NTSC/SECAMS) input and 1 composite video output ports; stereo input/output or single microphone input port; two UARTs, Ethernet interface, daughter board interface, PC104 interface and JTAG interface[6][7]. The board also provides 4M Bytes of Flash memory, located in the CE1 address space of DM642, with a width of 8 bits. The FPGA extends 3 address lines and divides the Flash into 8 pages. The first half of page 0 of the Flash stores the user's self-startup program, the second half stores the FPGA program, the end of page 1 stores the user data space, and pages 2 to 8 are used to store user programs.
3.2 DSP Implementation of the Core Algorithm
(1) Overall framework of the algorithm. The algorithm in this paper is mainly divided into two large modules when implemented based on DM642EVM (as shown in Figure 4). The first part is the DWT transformation module, which transforms the input image data into a series of wavelet coefficients; the second part is the EBCOT algorithm module, which encodes the quantized wavelet coefficients to generate a compressed bit stream. The block diagram of the hardware development platform is shown in Figure 5.
|
Figure 4 Algorithm framework diagram
|
|
Figure 5: Algorithm hardware development platform structure diagram
|
(2) Memory allocation. Image data processing often involves a large number of complex data addressing calculations. Complex addressing calculations may consume more CPU calculations than actual data operations. Therefore, to speed up the CPU's access to data, not only does the memory itself need to be fast, but a reasonable data structure is also needed to simplify the CPU's address calculations. In addition, DM642's data access technologies, such as Cache, EDMA, and wide-bit data direct reading and writing, are all based on the continuity of storage addresses. Based on the above considerations, this paper follows the following principles when allocating and locating memory: First, use shorter data types while meeting accuracy requirements; second, large data blocks, such as original images and reconstructed images, are stored in off-chip SDRAM; third, key data and small data blocks, such as coefficients during calculations, system stacks, and three-channel scans that require frequent access to data areas and context flag areas, are stored in on-chip memory; fourth, configure sufficient cache for the L2 level to facilitate the CPU's fast reading and writing of data; fifth, data with computational relevance should be arranged in order in memory. When it comes to moving data blocks within or outside the chip, the DM642's EDMA unit can complete the operation. It can work in parallel with the CPU and does not occupy the CPU's computing cycles[8].
(3) Reading and writing of image data. Since this paper mainly completes the compression function for images and does not involve image acquisition, appropriate processing is done on the input and output of image data. Considering that the CCS Simulator fully supports C/C++ language, the input of the original image data adopts the header file form in C language, and the wavelet transform module and EBCOT algorithm module adopt the data file form stored in the PC. This paper mainly adopts the form of header file and binary data file, and writes all the data of the non-file header part of the image into the .h file through the "fprintf (fp, "% 3d, ", image_in [i][j])" statement.
(4) Implementation of DWT. Since DM642 is a fixed-point processor and is not suitable for floating-point operations, this paper chooses LeGall (5, 3) integer filter to complete the wavelet transform in JPEG2000. When performing wavelet transform, first define two storage buffers equal to the size of the image block, one is the input buffer Buf of the image slice data, and the other is the result buffer TempBuf used to temporarily store the image slice data after wavelet transform. After each wavelet transform, the image slice data must be filtered twice by low-pass and high-pass filters of integer (5, 3). The high-pass filter data stored in TempBuf is processed by the integer (5, 3) filter to obtain the wavelet transform coefficients of the HL subband and the HH subband. Finally, the transform result is stored in the input buffer Buf. If the next level of decomposition is to be performed, the LL subband in Buf only needs to be processed in the same way.
(5) Implementation of the EBCOT algorithm. The EBCOT algorithm is the most time-consuming part in the JPEG2000 coding system, so optimizing this part is very meaningful to improve the performance of the entire system. On a PC, each channel in the EBCOT coding is processed independently. Therefore, when implementing it on DM642, this paper uses parallel technology to optimize the code and speed up the execution of the program. For example, when fetching bit plane data, it can be processed in parallel with constructing the context model. However, it is not a simple parallel processing. When the context model of channel 2 is to be formed, the data in its neighborhood has been changed when processing channel 1. This can increase the utilization of DM642 functional units and give full play to its parallel computing capabilities.
3.3 Experimental results
The experiment in this paper is based on Windows XP operating system, CPU Intel Pentium (R) 4 2.4GHz, 512M memory, CCS compilation environment. The program is downloaded to the DM642EVM development board through a USB emulator. LeGall (5, 3) wavelet is used to process 512×512 lena and barbara images. After testing, the corresponding encoding time of the encoder when the compression ratio is 16:1 is shown in Table 1.
The experiment gives the reconstructed images of the Lena image at three compression ratios of 8:1, 16:1, and 32:1, and gives the peak signal-to-noise ratio compared with the original image, as shown in Figure 6.
|
Figure 6 Barbara image encoding diagram
|
The data in Table 1 show that the encoding time of the encoder based on DSP is longer than that based on PC, because when the code is running on the DM642EVM hardware platform, it needs to continuously exchange data between the USB emulator and the PC, which increases the time cost. From the PSNR value in Figure 6, it can be seen that the reconstructed image of the lena image at a higher compression ratio still has a higher image quality. In terms of subjective evaluation, the reconstructed images with compression ratios of 8:1 and 16:1 are slightly different from the original images, and the visual effect is good; the reconstructed image with a compression ratio of 32:1 is slightly distorted. The experimental results show that the JPEG2000 encoding algorithm code transplanted to the DSP still has good compression performance.
4 Conclusion
In order to achieve efficient compression of images, DWT transform and EBCOT algorithm are used here, and two improvement methods are given. By transplanting the improved algorithm to the DSP development board, it can be seen that the reconstructed image of the image at a high compression ratio still has a higher image quality. The results show that the JPEG2000 encoding algorithm code transplanted to the DSP still has good compression performance and has a good application in image compression processing.
|