A DSP Processor Design Based on Power Consumption Management

fighting

A DSP Processor Design Based on Power Consumption Management [Copy link]

A DSP Processor Design Based on Power Consumption Management

Li Zhaohui
(School of Software and Microelectronics, Northwestern Polytechnical University, Xi'an 710065, China)

　　Abstract: A DSP processor with power management features is designed. The processor adopts a 4-stage pipeline and an enhanced Harvard parallel system structure and a perfect clock management module, providing an integrated design of a DSP processor.

　　Keywords: DSP processor pipeline Harvard architecture low power consumption

　　Today, as information becomes an increasingly important resource, strong market demand and the development of microelectronics technology have led to the rapid development of portable electronic systems. These portable electronic devices not only have very high requirements for speed and area, but also have strict requirements for the average power consumption of the system, making the power consumption problem increasingly a bottleneck restricting the development of portable electronic devices. To obtain a high-performance and low-power solution, the essence is to balance how to meet the requirements of the digital signal processing system in terms of processing speed, chip area and power consumption.
　　This paper introduces a system design of a low-power digital signal processor (DSP) based on parallel pipelines to improve the shortcomings of general-purpose processors and apply it to various portable systems to achieve good results.
　　In the system solution, the pipeline-based structure reduces the power consumption of the system from the behavioral level; the clock management solution allows the system to use different operating frequencies in different working modes, so as to minimize the power consumption of each single task; the enhanced Harvard structure storage management can greatly improve the parallelism of the system and improve the system efficiency.
1 Pipeline structure
　　The pipeline structure is one of the main methods to reduce power consumption at the chip behavioral level. The following briefly analyzes its principle. In the traditional analysis method, the power consumption of CMOS circuit can be estimated by the following equation:
　　

　　Where f = 1/Ts, Ts is the clock cycle of the original timing system. If it is an M-level pipeline system, its critical path is shortened to 1/M of the original path length, and the charging and discharging capacitance in one clock cycle is reduced to Ccharge/M (note that the total capacitance has not changed). If the clock speed remains unchanged, in the same time as the original charging and discharging of the capacitor Ccharge, only Ccharge/M needs to be charged and discharged, which means that the power supply voltage can be reduced to βVdd, where β is a constant less than 1. In this way, the power consumption of the pipeline filter will be:
　　

　　Compared with the original system, the power consumption of the pipeline system is reduced by β2 times.
　　The DSP processor adopts a 4-level pipeline structure as shown in Figure 1. The functions of each level of the pipeline are introduced as follows:
　　FI: Addressing stage. The program address generation module generates the instruction memory address and fetches the instruction.
　　DI: Decoding stage. The corresponding micro-control signal is generated by instruction decoding and sent to the corresponding control register.
　　FO: Operand fetching stage. Read the corresponding data from the register file or external memory and send it to the operation unit or register file through the data bus.
　　EXE/WB: Execution and write back phase. Perform calculations or operations, obtain the corresponding results, and put the results on the write bus (EB).

Figure 1 4-stage pipeline
2 Peripheral interface section
　　The peripheral interface section provides various connection methods inside and outside the system to realize various information transmission methods. This design divides these interfaces into two parts: (1) MCU type interfaces, such as low-speed serial ports (serial peripheral interface (SPI) and universal asynchronous receiver and transmitter (UART)), programmable communication interface (PCI), universal serial bus (USB) and some peripheral devices. (2) High-speed interfaces suitable for media information transmission and reception, such as asynchronous serial ports and parallel peripheral interfaces.
3 Data transmission design
　　Digital signal processing is an application with a large amount of data, so how to efficiently transmit data is a key bottleneck affecting system performance. As a DSP processor, it must have comprehensive DMA capabilities to transmit data inside and outside the chip. Because it is not realistic to integrate enough storage space inside the DSP chip, DMA must be used to manage the flow data and separate the data transmission and system control process. In this way, on the one hand, the data transmission speed can be improved, and on the other hand, the burden on the processor core can be reduced, thereby improving the system operation efficiency.
　　In the system design, DMA uses descriptor-based transmission. When initiating a DMA transmission sequence, it requires a set of parameters stored in the memory. This type of transfer allows multiple DMA sequences to be linked together. A DMA channel can be programmed to establish and start another DMA transfer after the current sequence is completed.
4 Design of multipliers and logic units
　　In digital signal processing applications, high-speed data operations are its outstanding features, so its structural design must have a separate multiplier to achieve its performance improvement. The structural block diagram of the multiplier and logic unit is shown in Figure 2.

Figure 2 CALU and multiplier structure diagram
　　When the multiplier is working, TR is loaded with a LT (Load TR) instruction, and TR provides a multiplier. The multiplication instruction provides another operand, which can be either from the data bus or an immediate value from the program bus. In either case, a stable product term output can be obtained in each cycle. The
　　three shifters are barrel shifters, which provide shift operations for 16-bit or 32-bit operands, which can greatly improve the speed of accumulation after multiplication. 5
Address processing module
　　The address processing module calculates the address of fetching instructions and data for the bus components, and also processes some repeated instructions and jump instructions. According to the characteristics of the instruction system, the address processing unit designed in this paper is shown in Figure 3.

Figure 3. Block diagram of address processing module
　　The derived address may come from S_BUS, or the value of the previous address plus 1, or one of the bus input data registers DataIn; the value of the instruction pointer IC may come from S_BUS or the result of self-increment; the prefetch pointer PreIC may come from IC or the result of self-increment. The final output address is one of the four addresses: the derived address register AddrTemp, the instruction pointer IC, the bus input data register DataIn, or the prefetch pointer PreIC.
　　When the executed instruction needs to calculate the effective address, the output address is the derived address register; when the program jumps, the output address is the instruction pointer IC; when the addressing mode is indirect addressing, the output address is DataIn; when the instruction is prefetched, the output address is the prefetch pointer PreIC.
　　Because the incremental calculation of AddrTemp and IC cannot appear at the same time in the system, only one incrementer is designed in the structural design for both to share.
6 Organization and management of memory
　　In digital signal processing systems, data throughput directly affects system performance. The traditional Von Neuman structure stores instructions and data in the same memory and addresses them uniformly, relying on the address provided by the instruction counter to distinguish between instructions and data. Both instruction fetching and data fetching access the same memory, resulting in low data throughput. The Harvard structure is different from the traditional Von Neuman parallel system structure. Its main feature is that programs and data are stored in different storage spaces, that is, program memory and data memory are two independent memories, each memory is independently addressed and accessed. Two buses for program and data are set up in the system, thereby increasing the data throughput by 1 times.
　　The design of this paper adopts the enhanced Harvard structure shown in Figure 4, which includes 1 program code memory and 2 data memories, where the program code memory only stores instructions, the program data memory stores program data, and the data memory stores general data. The access to these memories is independent of each other, and the system can provide 2 operands while fetching instructions, thereby greatly improving the execution efficiency of the system.
　　In order to use a larger virtual address space, the memory is managed by paging. Several different pages can occupy the same address space, and the paging register of each memory indicates which page is currently being accessed.

Figure 4 Enhanced Harvard structure
7 Clock Management Solution
　　As can be seen from formula (1), the power consumption of the system is linearly related to the clock frequency. Therefore, by reducing the system clock, the power consumption can be effectively reduced. The clock management solution provides the system with the frequency to work in different working modes. Its structure is shown in Figure 5. As can be seen from the figure, the external input clock CLKI is connected to the delay phase-locked loop DLL through the global input buffer IBUFG, and the original phase clock of the phase-locked loop is output through the global buffer BUFG, so that a stable original clock can be obtained on the chip; in the low power mode, the system can divide the original clock according to the value of the clock division counter configured by the user to generate a divided clock; if the system clock is to be stopped, the low level can be directly used as the clock output.

Figure 5 Clock management solution structure diagram
　　The above three clocks are output through the multiplexer, and the internally generated clock is no longer a stable clock. Therefore, the clock is output to the outside of the chip, and then the input is connected to the on-chip clock dedicated line, that is, it is connected to the delay phase-locked loop through the global input buffer, and the original phase clock of the phase-locked loop is output through the global buffer to generate a stable system master clock. At the same time, the phase-locked loop divided-by-two clock is output through the global buffer as the system status clock, which participates in system control. In addition, the clock lock flag LOCKED of the phase-locked loop is output to facilitate the observation of the stability of the internal clock during system debugging.
　　The design of the low-power DSP processor introduced in this article has the advantages of low cost, low complexity, and short time to market compared with other processor solutions, and can achieve various portable digital signal processing performance at a lower price. This design method can be used as a reference for similar designs.
References
1 Hennessy JL, Patterson DA. Computer Organization and Design: The Hardware/Software
Interface (Second Edition). Beijing: Machinery Industry Press (Photocopy Edition), 1998
2 Katz RH. Contemporary Logic Design. Addison Wesley, MA, 1993
3 Su Guangda. Image Parallel Processing Technology. Beijing: Tsinghua University Press, 2002
4 Chen Feng. Blackfin Series DSP Principle and System Design. Beijing: Electronic Industry Press, 2004