Optimized Design of MP3 Decoding Based on ARM946E Processor

Publisher:电子创新者Latest update time:2010-02-22 Source: 现代电子技术 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

0 Introduction

MP3 (MPEG I Audio Layer 3) is a sound file format based on the Motion Picture Expert Group (MPEG) compression standard. Its compression ratio varies according to the sampling frequency, compression bit rate and sound mode. MP3 has a very high compression ratio, which can reach 1:12. After a minute of CD music is compressed and encoded in MP3 format, it can be compressed to a capacity of about 1 megabyte, and its timbre and sound quality can remain basically intact without distortion. With the increasing popularity of digital music, MP3 music is no longer limited to MPEG video applications, but appears as an independent digital music compression technology on computers, networks and various electronic devices. The popular MP3 players on the market are solutions based on DSP and dedicated chips. Decoding is achieved through hardware or dedicated algorithms, and has good real-time performance. Consumer electronic products are developing in the direction of multi-function and low cost. With the continuous enhancement of ARM9 functions, it is possible to use the system's own processor to implement MP3 soft decoding. In addition, software implementation is more convenient for product function upgrades and maintenance. It can be foreseen that the application of embedded MP3 soft decoders will become more and more extensive. Based on the analysis of MPEG I Audio Layer3 decoding algorithm, a software optimization method for implementing the decoding algorithm based on ARM946E processor is proposed.

1 MPEG Audio Layer3 decoding process

The MP3 decoding algorithm flow is shown in Figure 1.

MP3 decoding algorithm flow

The main processes include: data stream decoding, Huffman decompression, inverse quantization and reordering, stereo decoding, IMDCT and sub-band synthesis operations, etc. Among them, Huffman decoding and inverse quantization, IMDCT and sub-band synthesis occupy the most CPU and memory resources in the MP3 decoding process, and are the key to implementing software decoding in embedded systems.

2 ARM946E processor

The ARM946E processor is a synthesizable version of the ARM9 core with E extensions, executing v5TE architecture instructions. It uses a 5-stage pipeline, and the memory system is redesigned according to the Harvard architecture, with independent data and instruction buses. It has a memory subsystem to improve system performance and support large operating systems.

As shown in FIG. 2 , the memory subsystem includes a memory protection unit (MPU), a cache, and a write buffer; the CPU is connected to the system memory through the subsystem.

ARM9E internal memory subsystem

Compared with ARM7, the performance improvement of ARM9E is mainly reflected in the operating frequency, improved hardware features and optimized instruction execution efficiency. In addition, ARM9E integrates lightweight DSP processing capabilities, which can achieve very practical DSP performance at a very low cost (adding hardware is required to increase CPU functions). Making full use of chip resources is the key to achieving MP3 decoding optimization.

3 Algorithm Optimization

For Huffman decoding and inverse quantization, IMDCT and subband synthesis, which involve large operations in MP3, algorithm optimization processing is proposed respectively. 3.1 Fixed-length redundant table lookup Huffman decoding algorithm The Huffman decoder can detect each symbol one by one from the beginning to the end, and perform decoding by table lookup comparison. That is, it can distinguish Huffman codewords of different lengths from the one-dimensional bit stream, and then perform complex matching.

Since the lengths of the Huffman code table groups in Laye III are different, the search time for codewords will increase. The fixed-length redundant table search method expands the Huff_man lookup table, and each time a fixed-length N-bit code stream is selected as the search index. The lookup table includes jump pointers and code values. If the node index value is a jump pointer, the subsequent bit number of this Huffman code will be known by expanding the Huff-man lookup table, and jump to another node; then the value is taken from the code stream according to the subsequent bit number; then the search starts from the last jump node, and repeats until the corresponding Huffman code content is found. The lookup table is implemented using the Union data structure, which can reduce the space occupied by the Huffman table. Assuming that the length of a Huff-man code is l, the traditional algorithm requires 1 shift operation and 1 comparison, while the fixed-length search method only requires [z/N] searches and [l/N] comparison operations.

Table 1 and Table 2 are examples of Huffman decoding:

Huffman decoding table

Huffman augmented lookup table

official

The amount of calculation can be reduced by half.

The subband synthesis filter includes 32-point to 64-point IMDCT processing in the decoding process, as shown in equation (3):

official

Since N(i)(k) has symmetric characteristics, it can be concluded that:

official

It is sufficient to calculate the V(i) value in the range of 0≤i

4 Code Optimization

According to the hardware characteristics of ARM946E processor, C language and ARM assembly level code optimization is performed for key programs with high real-time requirements.

4.1 Down-counting loop

There are multiple loop operations in the two parts with the largest computational workload, IMDCT and subband synthesis filter bank. In order to improve execution efficiency, it is recommended to use a count-down loop.

As shown in Table 3, for a fixed number of loops, the down-count loop is faster than the up-count loop. This is because each up-count loop has 3 instructions outside the body, while the down-count loop has only 2 instructions outside the body. The termination condition of the down-count loop is when the count is down to zero, not when the count is increased to a certain limit value. Since the down-count result is stored in the instruction condition flag, the instruction to compare with zero is omitted.

Increase and decrease count loop body comparison

4.2 Inline functions and inline assembly

The fixed-point multiplication in the MP3 decoding algorithm is implemented through function calls. Each call requires 23 to 28 clock cycles, of which more than 15 cycles are used for the PC pointer and register stack protection when calling the function. Using inline functions (declared using the keyword _inline) or macro instructions, the code segment will be directly expanded during the compilation stage. In addition, the armcc compiler allows the use of embedded assembly in C source programs (but the code portability is poor). Using embedded functions including assembly can enable the compiler to support ARM instructions and optimization methods that are usually not effectively used, such as ARM v5E extended instructions that are not supported by the C compiler. Using inline functions combined with embedded assembly to implement shift multiplication can shorten the average clock cycle to 6 to 8.

4.3 Application of ARM DSP extension instructions

The ARM946E processor supports ARMDSP extended instructions, which mainly include 3 types:

(1) Single-cycle 16×16 and 32×16 MAC operations;

(2) Added saturation processing extension to the original arithmetic operation instructions;

(3) Leading Zero (CLZ) instruction improves the performance of normalization, floating-point operations, and division operations.

ARM processors do not support floating-point operations. After testing and analysis, the truncation error of the numerical value in fixed-point operations is selected as 28 b, which can achieve better decoding sound quality and will not affect the playback effect due to excessive popping sounds.

To complete a similar multiplication function, ARM's SMULL (32×32) instruction requires 3 cycles, while the ARMDSP extended instruction SMULWT (32×16) only requires 1 cycle. From the perspective of data accuracy, since the precision of the multiplier is 16 bits, the final result is somewhat different, but since MP3 decoding operations are all based on 28-bit fixed-point values, the usual operation is to multiply a result by the data in a fixed-point table. If the high 16 bits of the data in the fixed-point table are selected for operation, the error of the result of the operation is within 1 bit.

In order to verify the optimization effect of using ARMDSP extended instructions, the encoding test was carried out at a compression rate of 128 Kb/s under the system frequency of 120 MHz. The test files used are shown in Table 4.

Test Files

The bit rates of the above three MP3 test files are all 128 Kb/s. The decoding analysis results using these three MP3 songs are shown in Figure 3.

Decoding analysis results of three MP3 songs

Experiments show that the decoding performance of using ARMDSP extended instructions is improved by an average of 17.5% compared to using ARM general instructions, and there is no difference in the sound quality in subjective hearing.

5 Conclusion

Here we make full use of the DSP extended instruction characteristics of the ARM946E processor to improve the execution efficiency of the program code, optimize and simplify the algorithms of three key modules: Huff-man decoding, IMDCT operation, and synthetic sub-band filtering, reducing the amount of calculation in each module. At the same time, we optimize the code from the C language and ARM assembly level to achieve better real-time MP3 decoding results.

Reference address:Optimized Design of MP3 Decoding Based on ARM946E Processor

Previous article:Brushless DC fan motor 180 degree sine wave control
Next article:A brief discussion on the application of C8051 single chip microcomputer in variable air volume air conditioning control system

Latest Microcontroller Articles
  • Download from the Internet--ARM Getting Started Notes
    A brief introduction: From today on, the ARM notebook of the rookie is open, and it can be regarded as a place to store these notes. Why publish it? Maybe you are interested in it. In fact, the reason for these notes is ...
  • Learn ARM development(22)
    Turning off and on interrupts Interrupts are an efficient dialogue mechanism, but sometimes you don't want to interrupt the program while it is running. For example, when you are printing something, the program suddenly interrupts and another ...
  • Learn ARM development(21)
    First, declare the task pointer, because it will be used later. Task pointer volatile TASK_TCB* volatile g_pCurrentTask = NULL;volatile TASK_TCB* vol ...
  • Learn ARM development(20)
    With the previous Tick interrupt, the basic task switching conditions are ready. However, this "easterly" is also difficult to understand. Only through continuous practice can we understand it. ...
  • Learn ARM development(19)
    After many days of hard work, I finally got the interrupt working. But in order to allow RTOS to use timer interrupts, what kind of interrupts can be implemented in S3C44B0? There are two methods in S3C44B0. ...
  • Learn ARM development(14)
  • Learn ARM development(15)
  • Learn ARM development(16)
  • Learn ARM development(17)
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号