0 Introduction
MP3 (MPEG I Audio Layer 3) is a sound file format based on the Motion Picture Expert Group (MPEG) compression standard. Its compression ratio varies according to the sampling frequency, compression bit rate and sound mode. MP3 has a very high compression ratio, which can reach 1:12. After a minute of CD music is compressed and encoded in MP3 format, it can be compressed to a capacity of about 1 megabyte, and its timbre and sound quality can remain basically intact without distortion. With the increasing popularity of digital music, MP3 music is no longer limited to MPEG video applications, but appears as an independent digital music compression technology on computers, networks and various electronic devices. The popular MP3 players on the market are solutions based on DSP and dedicated chips. Decoding is achieved through hardware or dedicated algorithms, and has good real-time performance. Consumer electronic products are developing in the direction of multi-function and low cost. With the continuous enhancement of ARM9 functions, it is possible to use the system's own processor to implement MP3 soft decoding. In addition, software implementation is more convenient for product function upgrades and maintenance. It can be foreseen that the application of embedded MP3 soft decoders will become more and more extensive. Based on the analysis of MPEG I Audio Layer3 decoding algorithm, a software optimization method for implementing the decoding algorithm based on ARM946E processor is proposed.
1 MPEG Audio Layer3 decoding process
The MP3 decoding algorithm flow is shown in Figure 1.
The main processes include: data stream decoding, Huffman decompression, inverse quantization and reordering, stereo decoding, IMDCT and sub-band synthesis operations, etc. Among them, Huffman decoding and inverse quantization, IMDCT and sub-band synthesis occupy the most CPU and memory resources in the MP3 decoding process, and are the key to implementing software decoding in embedded systems.
2 ARM946E processor
The ARM946E processor is a synthesizable version of the ARM9 core with E extensions, executing v5TE architecture instructions. It uses a 5-stage pipeline, and the memory system is redesigned according to the Harvard architecture, with independent data and instruction buses. It has a memory subsystem to improve system performance and support large operating systems.
As shown in FIG. 2 , the memory subsystem includes a memory protection unit (MPU), a cache, and a write buffer; the CPU is connected to the system memory through the subsystem.
Compared with ARM7, the performance improvement of ARM9E is mainly reflected in the operating frequency, improved hardware features and optimized instruction execution efficiency. In addition, ARM9E integrates lightweight DSP processing capabilities, which can achieve very practical DSP performance at a very low cost (adding hardware is required to increase CPU functions). Making full use of chip resources is the key to achieving MP3 decoding optimization.
3 Algorithm Optimization
For Huffman decoding and inverse quantization, IMDCT and subband synthesis, which involve large operations in MP3, algorithm optimization processing is proposed respectively. 3.1 Fixed-length redundant table lookup Huffman decoding algorithm The Huffman decoder can detect each symbol one by one from the beginning to the end, and perform decoding by table lookup comparison. That is, it can distinguish Huffman codewords of different lengths from the one-dimensional bit stream, and then perform complex matching.
Since the lengths of the Huffman code table groups in Laye III are different, the search time for codewords will increase. The fixed-length redundant table search method expands the Huff_man lookup table, and each time a fixed-length N-bit code stream is selected as the search index. The lookup table includes jump pointers and code values. If the node index value is a jump pointer, the subsequent bit number of this Huffman code will be known by expanding the Huff-man lookup table, and jump to another node; then the value is taken from the code stream according to the subsequent bit number; then the search starts from the last jump node, and repeats until the corresponding Huffman code content is found. The lookup table is implemented using the Union data structure, which can reduce the space occupied by the Huffman table. Assuming that the length of a Huff-man code is l, the traditional algorithm requires 1 shift operation and 1 comparison, while the fixed-length search method only requires [z/N] searches and [l/N] comparison operations.
Table 1 and Table 2 are examples of Huffman decoding:
The amount of calculation can be reduced by half.
The subband synthesis filter includes 32-point to 64-point IMDCT processing in the decoding process, as shown in equation (3):
Since N(i)(k) has symmetric characteristics, it can be concluded that:
It is sufficient to calculate the V(i) value in the range of 0≤i
4 Code Optimization
According to the hardware characteristics of ARM946E processor, C language and ARM assembly level code optimization is performed for key programs with high real-time requirements.
4.1 Down-counting loop
There are multiple loop operations in the two parts with the largest computational workload, IMDCT and subband synthesis filter bank. In order to improve execution efficiency, it is recommended to use a count-down loop.
As shown in Table 3, for a fixed number of loops, the down-count loop is faster than the up-count loop. This is because each up-count loop has 3 instructions outside the body, while the down-count loop has only 2 instructions outside the body. The termination condition of the down-count loop is when the count is down to zero, not when the count is increased to a certain limit value. Since the down-count result is stored in the instruction condition flag, the instruction to compare with zero is omitted.
4.2 Inline functions and inline assembly
The fixed-point multiplication in the MP3 decoding algorithm is implemented through function calls. Each call requires 23 to 28 clock cycles, of which more than 15 cycles are used for the PC pointer and register stack protection when calling the function. Using inline functions (declared using the keyword _inline) or macro instructions, the code segment will be directly expanded during the compilation stage. In addition, the armcc compiler allows the use of embedded assembly in C source programs (but the code portability is poor). Using embedded functions including assembly can enable the compiler to support ARM instructions and optimization methods that are usually not effectively used, such as ARM v5E extended instructions that are not supported by the C compiler. Using inline functions combined with embedded assembly to implement shift multiplication can shorten the average clock cycle to 6 to 8.
4.3 Application of ARM DSP extension instructions
The ARM946E processor supports ARMDSP extended instructions, which mainly include 3 types:
(1) Single-cycle 16×16 and 32×16 MAC operations;
(2) Added saturation processing extension to the original arithmetic operation instructions;
(3) Leading Zero (CLZ) instruction improves the performance of normalization, floating-point operations, and division operations.
ARM processors do not support floating-point operations. After testing and analysis, the truncation error of the numerical value in fixed-point operations is selected as 28 b, which can achieve better decoding sound quality and will not affect the playback effect due to excessive popping sounds.
To complete a similar multiplication function, ARM's SMULL (32×32) instruction requires 3 cycles, while the ARMDSP extended instruction SMULWT (32×16) only requires 1 cycle. From the perspective of data accuracy, since the precision of the multiplier is 16 bits, the final result is somewhat different, but since MP3 decoding operations are all based on 28-bit fixed-point values, the usual operation is to multiply a result by the data in a fixed-point table. If the high 16 bits of the data in the fixed-point table are selected for operation, the error of the result of the operation is within 1 bit.
In order to verify the optimization effect of using ARMDSP extended instructions, the encoding test was carried out at a compression rate of 128 Kb/s under the system frequency of 120 MHz. The test files used are shown in Table 4.
The bit rates of the above three MP3 test files are all 128 Kb/s. The decoding analysis results using these three MP3 songs are shown in Figure 3.
Experiments show that the decoding performance of using ARMDSP extended instructions is improved by an average of 17.5% compared to using ARM general instructions, and there is no difference in the sound quality in subjective hearing.
5 Conclusion
Here we make full use of the DSP extended instruction characteristics of the ARM946E processor to improve the execution efficiency of the program code, optimize and simplify the algorithms of three key modules: Huff-man decoding, IMDCT operation, and synthetic sub-band filtering, reducing the amount of calculation in each module. At the same time, we optimize the code from the C language and ARM assembly level to achieve better real-time MP3 decoding results.
Previous article:Brushless DC fan motor 180 degree sine wave control
Next article:A brief discussion on the application of C8051 single chip microcomputer in variable air volume air conditioning control system
- Popular Resources
- Popular amplifiers
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- Rambus Launches Industry's First HBM 4 Controller IP: What Are the Technical Details Behind It?
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- Download address of the e-book "A Brief History of Science"
- THAT4320 low voltage dynamic processing circuit for wireless products
- Qorvo PAC series highly integrated motor control chips and applications
- Boys tend to misunderstand several behaviors of girls
- Application of machine vision technology in film inspection system
- 2021 Open Source Hardware Summit Calls for Proposals
- If pure fuel vehicles are no longer sold in 2025, is it still necessary to buy high-priced fuel vehicles now?
- Liquid crystal module handling precautions and storage conditions
- Thank you for being here
- Ask a question about the make command