In the new image compression standard JPEG2000, 9/7 and 5/3 lifting wavelet transforms are used as coding algorithms. The 5/3 wavelet transform is a reversible integer transform that can achieve lossless or lossy image compression. The algorithm is implemented on a general-purpose DSP chip with good scalability, upgradeability and maintainability. This method is highly flexible and can fully meet various processing requirements.
1 Lifting Algorithm
The lifting algorithm [1] was proposed by Sweldens et al. based on the Mallat algorithm and is also called the second-generation wavelet transform. Compared with the Mallat algorithm, the lifting algorithm does not rely on Fourier transform, reduces the amount of calculation and complexity, and improves the operating efficiency accordingly. Due to the characteristics of integer transform and less storage unit consumption, the lifting algorithm is very suitable for implementation on fixed-point DSP. The
basic idea of the wavelet lifting algorithm is to gradually construct a new wavelet with better properties through the basic wavelet. Its implementation steps are decomposition (split), prediction (predict) and update (update).
First, the new x(n) is obtained by symmetric extension of the original signal.
Decomposition is to divide the data into two parts: the even sequence x(2n) and the odd sequence x(2n+1);
prediction is to use the decomposed even sequence to predict the odd sequence, and the obtained prediction error is the transformed high-frequency component: H(n)=x(2n+1)-{[x(2n)+x(2n+2)]>>1};
updating is to update the even sequence with the prediction error to obtain the transformed low-frequency component: L(n)=x(2n)+{[H(n)+H(n-1)+2]>>2}.
The calculation process is shown in Figure 1.
2 Optimization strategy based on DM642
2.1 Two-level cache structure of DM642
DM642 is a processor specially designed for multimedia processing applications and is a good platform for building multimedia communication systems. It uses the C64x DSP core and the on-chip RAM adopts a two-level cache structure [4][5], which is divided into L1P, L1D and L2. L1 can only be accessed by the CPU as a cache, which is 16KB. The access cycle is consistent with the CPU cycle. L1P is directly mapped and L1D is two-way group-associated. L2 can be configured as a cache and SRAM by the program.
2.2 Improved algorithm structure
Traditional wavelet transform transforms the entire image, first transforming each row and then each column. This method is inefficient when implementing the algorithm on the DSP. Because the DSP's L1D is very small, only 16KB, it cannot cache the entire image. Therefore, the original image data is usually stored in a low-speed external memory. In this way, when the CPU reads a row of data from the L1D, it will inevitably generate a miss. A large number of misses will seriously block the CPU operation and prolong the program execution time. In order to reduce the occurrence of missing data, the traditional transformation must be improved. The original transformation of the entire image is changed to a block-by-block transformation, that is, each time a block is taken from the image, and the row and column transformations are completed successively, and then the blocks are saved in the coefficient cache according to certain rules, as shown in Figure 2.
In this method, a data block in SDRAM is first transferred to L2, then fetched to L1D for horizontal promotion, and then the block is promoted vertically. In this way, since the data required for vertical promotion is all in L1D, the occurrence of data cache misses here is avoided, and the total number of misses is greatly reduced.
2.3 Data transfer
(1) Data transfer between SDRAM and L2
Since EDMA[6][7] data transfer is independent of CPU operation, two caches are opened in L2: EDMA transfers the next block of data to InBuffB while the CPU is processing InBuffA, which solves the delay caused by the CPU reading the low-speed device SDRAM, as shown in Figure 3.
(2) Data transfer between L2 and L1D
The CPU first accesses the programs and data in the first-level CACHE. If there is no hit, it accesses the second-level CACHE (if part of L2 is configured as CACHE). If there is still no hit, it accesses the external storage space. In this process, the CPU is always in a blocked state until the read data is valid. Therefore, when the data block in L2 is promoted horizontally, the CPU will generate a miss for each line read. In response to this situation, the TMS320C64x series DSP provides a pipeline processing mechanism for L1D to handle cache misses. If there are multiple consecutive misses, the CPU waiting time will overlap, which generally reduces the CPU blocking time caused by the average miss.
Therefore, before the CPU promotes the data horizontally, the miss pipeline technology is used to read all the current data blocks into L1D, and then the data block is promoted horizontally, so that no miss will occur again and the operation speed can be improved.
2.4 L1P and L1D performance optimization
L1D is two-way group-related, each group is 8KB, and the total capacity is 16KB. The data processed by the CPU at one time should not exceed 8KB, and all the original data are stored continuously in the same CACHE group; the intermediate process data of the program is retained in another pre-allocated CACHE group.
After the data is read into L1D, it is first expanded from 8 bits to 16 bits, and then the data is horizontally promoted. As long as the data can be retained in L1D, the subsequent vertical promotion can completely avoid missing. Therefore, the size of the data block is determined by the intermediate process data. The sum of all intermediate process data cannot exceed 8KB, and the selected data block is 32×32.
When multiple functions are mapped to the same CACHE line of L1P, conflict missing will occur, so these functions must be placed reasonably. Since the total sum of all functions that implement promotion does not exceed 16KB, if these functions can be arranged in a continuous storage space, L1P missing caused by conflicts can be completely avoided. You can add a GROUP to the SECTIONS of the cmd[8] file, and then put the frequently called functions into the GROUP:
SECTIONS
{
GROUP > ISRAM
{
.text:_horz
.text:_vert
.text:_IMG_pix_pand
…
}…}
2.5 Program Optimization
From the previous analysis, it can be seen that when performing lifting wavelet transform on an image, its four boundaries need to be extended. The extension method uses the symmetrical extension shown in Figure 1, where one more point needs to be extended on the left and top. When performing lifting transform on a block in an image, the extension should be the boundary data of the four blocks adjacent to the block, as shown in Figure 4.
Boundary extension is mainly used to calculate high-frequency coefficients. Analysis shows that when lifting horizontally, the last high-frequency coefficient of each row of the current data block is the same as the first high-frequency coefficient of the next block in the same row. Therefore, as long as these coefficients of the current block are saved, the first high-frequency coefficient does not need to be calculated again when lifting horizontally on the next block, so there is no need to extend its left boundary. The same is true for lifting in the vertical direction. Two arrays are added to the program to store the last high-frequency coefficient of each row and column of the current block respectively. This method can reduce the complexity of the program, improve execution efficiency, and reduce the occurrence of missing.
The pixel expansion function pix_pand[9] uses TI's IMGLIB algorithm library. The horizontal lifting and vertical lifting functions are both written by the author in linear assembly language, making full use of the half-word processing instructions of the 64x series DSP and using half-word packing technology to maximize the execution efficiency of the program.
When lifting horizontally, the data of each row is reordered to become the structure shown in Figure 5.
Using the C64x's ADD2, SHR2, and SUB2 halfword processing instructions, the following two operations are executed in parallel:
H(1)=B-[(A+C)>>1]
H(2)=D-[(C+E)>>1]
When lifting vertically, multiple columns of calculations can be arranged to be executed in parallel, as shown in Figure 6.
H1(1)=B1-[(A1+C1)>>1]
H2(1)=B2-[(A2+C2)>>1]
3 Simulation results
Table 1 lists the number of misses generated when the CPU reads L1D. Among them, misses in the horizontal direction are inevitable. Since the right and bottom of the data block need to be extended, the number of misses in the horizontal direction is slightly higher than that of the traditional method; in the vertical direction, the algorithm completely avoids the occurrence of misses.
Table 2 lists the computational performance of several methods. Since this paper uses a variety of optimization techniques, the computational speed is increased by 4 to 10 times. Experiments have shown that these methods are very effective.
Previous article:Research on the Application of CS4235 in DSP Embedded System
Next article:Analysis on PHS Network Optimization of DSP Dual-mode Mobile Phone
- Popular Resources
- Popular amplifiers
- USB Type-C® and USB Power Delivery: Designed for extended power range and battery-powered systems
- ROHM develops the second generation of MUS-IC™ series audio DAC chips suitable for high-resolution audio playback
- ADALM2000 Experiment: Transformer-Coupled Amplifier
- High signal-to-noise ratio MEMS microphone drives artificial intelligence interaction
- Advantages of using a differential-to-single-ended RF amplifier in a transmit signal chain design
- ON Semiconductor CEO Appears at Munich Electronica Show and Launches Treo Platform
- ON Semiconductor Launches Industry-Leading Analog and Mixed-Signal Platform
- Analog Devices ADAQ7767-1 μModule DAQ Solution for Rapid Development of Precision Data Acquisition Systems Now Available at Mouser
- Domestic high-precision, high-speed ADC chips are on the rise
- Intel promotes AI with multi-dimensional efforts in technology, application, and ecology
- ChinaJoy Qualcomm Snapdragon Theme Pavilion takes you to experience the new changes in digital entertainment in the 5G era
- Infineon's latest generation IGBT technology platform enables precise control of speed and position
- Two test methods for LED lighting life
- Don't Let Lightning Induced Surges Scare You
- Application of brushless motor controller ML4425/4426
- Easy identification of LED power supply quality
- World's first integrated photovoltaic solar system completed in Israel
- Sliding window mean filter for avr microcontroller AD conversion
- What does call mean in the detailed explanation of ABB robot programming instructions?
- STMicroelectronics discloses its 2027-2028 financial model and path to achieve its 2030 goals
- 2024 China Automotive Charging and Battery Swapping Ecosystem Conference held in Taiyuan
- State-owned enterprises team up to invest in solid-state battery giant
- The evolution of electronic and electrical architecture is accelerating
- The first! National Automotive Chip Quality Inspection Center established
- BYD releases self-developed automotive chip using 4nm process, with a running score of up to 1.15 million
- GEODNET launches GEO-PULSE, a car GPS navigation device
- Should Chinese car companies develop their own high-computing chips?
- Infineon and Siemens combine embedded automotive software platform with microcontrollers to provide the necessary functions for next-generation SDVs
- Continental launches invisible biometric sensor display to monitor passengers' vital signs
- Bluetooth is down: BIAS attacks threaten all mainstream Bluetooth chips
- Will there be conflicts between Wi-Fi and Bluetooth in the same frequency band?
- Analog signal isolation GP9303+GP8101 solution
- A heating unit power supply solution with constant and adjustable output power based on TPS61022
- Off-road vehicle modification 2
- Who remembers a video about the principles of network communication?
- There is a new breakthrough in small lithium-ion secondary batteries!
- [Sipeed LicheeRV 86 Panel Review] 7 - lvgl Solution to the problem of incorrect image color display
- A Qualcomm "contract engineer" who was fired after seven years committed suicide by jumping off a building.
- Wireless communication technology-NB-IoT