Optimization of 5/3 lifting wavelet on DM642

Publisher:baiyuguojiLatest update time:2013-09-25 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

In the new image compression standard JPEG2000, 9/7 and 5/3 lifting wavelet transforms are used as coding algorithms. The 5/3 wavelet transform is a reversible integer transform that can achieve lossless or lossy image compression. The algorithm is implemented on a general-purpose DSP chip with good scalability, upgradeability and maintainability. This method is highly flexible and can fully meet various processing requirements.
1 Lifting Algorithm
The lifting algorithm [1] was proposed by Sweldens et al. based on the Mallat algorithm and is also called the second-generation wavelet transform. Compared with the Mallat algorithm, the lifting algorithm does not rely on Fourier transform, reduces the amount of calculation and complexity, and improves the operating efficiency accordingly. Due to the characteristics of integer transform and less storage unit consumption, the lifting algorithm is very suitable for implementation on fixed-point DSP. The
basic idea of ​​the wavelet lifting algorithm is to gradually construct a new wavelet with better properties through the basic wavelet. Its implementation steps are decomposition (split), prediction (predict) and update (update).
First, the new x(n) is obtained by symmetric extension of the original signal.
Decomposition is to divide the data into two parts: the even sequence x(2n) and the odd sequence x(2n+1);
prediction is to use the decomposed even sequence to predict the odd sequence, and the obtained prediction error is the transformed high-frequency component: H(n)=x(2n+1)-{[x(2n)+x(2n+2)]>>1};
updating is to update the even sequence with the prediction error to obtain the transformed low-frequency component: L(n)=x(2n)+{[H(n)+H(n-1)+2]>>2}.
The calculation process is shown in Figure 1.


2 Optimization strategy based on DM642
2.1 Two-level cache structure of DM642
DM642 is a processor specially designed for multimedia processing applications and is a good platform for building multimedia communication systems. It uses the C64x DSP core and the on-chip RAM adopts a two-level cache structure [4][5], which is divided into L1P, L1D and L2. L1 can only be accessed by the CPU as a cache, which is 16KB. The access cycle is consistent with the CPU cycle. L1P is directly mapped and L1D is two-way group-associated. L2 can be configured as a cache and SRAM by the program.
2.2 Improved algorithm structure
Traditional wavelet transform transforms the entire image, first transforming each row and then each column. This method is inefficient when implementing the algorithm on the DSP. Because the DSP's L1D is very small, only 16KB, it cannot cache the entire image. Therefore, the original image data is usually stored in a low-speed external memory. In this way, when the CPU reads a row of data from the L1D, it will inevitably generate a miss. A large number of misses will seriously block the CPU operation and prolong the program execution time. In order to reduce the occurrence of missing data, the traditional transformation must be improved. The original transformation of the entire image is changed to a block-by-block transformation, that is, each time a block is taken from the image, and the row and column transformations are completed successively, and then the blocks are saved in the coefficient cache according to certain rules, as shown in Figure 2.


In this method, a data block in SDRAM is first transferred to L2, then fetched to L1D for horizontal promotion, and then the block is promoted vertically. In this way, since the data required for vertical promotion is all in L1D, the occurrence of data cache misses here is avoided, and the total number of misses is greatly reduced.
2.3 Data transfer
(1) Data transfer between SDRAM and L2
Since EDMA[6][7] data transfer is independent of CPU operation, two caches are opened in L2: EDMA transfers the next block of data to InBuffB while the CPU is processing InBuffA, which solves the delay caused by the CPU reading the low-speed device SDRAM, as shown in Figure 3.


(2) Data transfer between L2 and L1D
The CPU first accesses the programs and data in the first-level CACHE. If there is no hit, it accesses the second-level CACHE (if part of L2 is configured as CACHE). If there is still no hit, it accesses the external storage space. In this process, the CPU is always in a blocked state until the read data is valid. Therefore, when the data block in L2 is promoted horizontally, the CPU will generate a miss for each line read. In response to this situation, the TMS320C64x series DSP provides a pipeline processing mechanism for L1D to handle cache misses. If there are multiple consecutive misses, the CPU waiting time will overlap, which generally reduces the CPU blocking time caused by the average miss.
Therefore, before the CPU promotes the data horizontally, the miss pipeline technology is used to read all the current data blocks into L1D, and then the data block is promoted horizontally, so that no miss will occur again and the operation speed can be improved.
2.4 L1P and L1D performance optimization
L1D is two-way group-related, each group is 8KB, and the total capacity is 16KB. The data processed by the CPU at one time should not exceed 8KB, and all the original data are stored continuously in the same CACHE group; the intermediate process data of the program is retained in another pre-allocated CACHE group.
After the data is read into L1D, it is first expanded from 8 bits to 16 bits, and then the data is horizontally promoted. As long as the data can be retained in L1D, the subsequent vertical promotion can completely avoid missing. Therefore, the size of the data block is determined by the intermediate process data. The sum of all intermediate process data cannot exceed 8KB, and the selected data block is 32×32.
When multiple functions are mapped to the same CACHE line of L1P, conflict missing will occur, so these functions must be placed reasonably. Since the total sum of all functions that implement promotion does not exceed 16KB, if these functions can be arranged in a continuous storage space, L1P missing caused by conflicts can be completely avoided. You can add a GROUP to the SECTIONS of the cmd[8] file, and then put the frequently called functions into the GROUP:
SECTIONS
{
GROUP > ISRAM
{
.text:_horz
.text:_vert
.text:_IMG_pix_pand

}…}
2.5 Program Optimization
From the previous analysis, it can be seen that when performing lifting wavelet transform on an image, its four boundaries need to be extended. The extension method uses the symmetrical extension shown in Figure 1, where one more point needs to be extended on the left and top. When performing lifting transform on a block in an image, the extension should be the boundary data of the four blocks adjacent to the block, as shown in Figure 4.


Boundary extension is mainly used to calculate high-frequency coefficients. Analysis shows that when lifting horizontally, the last high-frequency coefficient of each row of the current data block is the same as the first high-frequency coefficient of the next block in the same row. Therefore, as long as these coefficients of the current block are saved, the first high-frequency coefficient does not need to be calculated again when lifting horizontally on the next block, so there is no need to extend its left boundary. The same is true for lifting in the vertical direction. Two arrays are added to the program to store the last high-frequency coefficient of each row and column of the current block respectively. This method can reduce the complexity of the program, improve execution efficiency, and reduce the occurrence of missing.
The pixel expansion function pix_pand[9] uses TI's IMGLIB algorithm library. The horizontal lifting and vertical lifting functions are both written by the author in linear assembly language, making full use of the half-word processing instructions of the 64x series DSP and using half-word packing technology to maximize the execution efficiency of the program.
When lifting horizontally, the data of each row is reordered to become the structure shown in Figure 5.


Using the C64x's ADD2, SHR2, and SUB2 halfword processing instructions, the following two operations are executed in parallel:
 H(1)=B-[(A+C)>>1]
 H(2)=D-[(C+E)>>1]
When lifting vertically, multiple columns of calculations can be arranged to be executed in parallel, as shown in Figure 6.
 H1(1)=B1-[(A1+C1)>>1]
 H2(1)=B2-[(A2+C2)>>1]
 

 

3 Simulation results
Table 1 lists the number of misses generated when the CPU reads L1D. Among them, misses in the horizontal direction are inevitable. Since the right and bottom of the data block need to be extended, the number of misses in the horizontal direction is slightly higher than that of the traditional method; in the vertical direction, the algorithm completely avoids the occurrence of misses.


Table 2 lists the computational performance of several methods. Since this paper uses a variety of optimization techniques, the computational speed is increased by 4 to 10 times. Experiments have shown that these methods are very effective.

Reference address:Optimization of 5/3 lifting wavelet on DM642

Previous article:Research on the Application of CS4235 in DSP Embedded System
Next article:Analysis on PHS Network Optimization of DSP Dual-mode Mobile Phone

Latest Analog Electronics Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号