Design Concept and Application of Real-time Bus Module in SoC System-EEWORLD

Collect

In chip design, the design of the chip's internal bus often determines the chip's performance, power consumption, and the complexity of each module design. We often design the bus based on two principles: one is the inherent requirements of the chip design process, and the other is the application's requirements for switching bandwidth, delay, efficiency, and flexibility.

In response to the inherent requirements of the chip bus design process, the basic principles usually followed by efficient bus structure design include: synchronous design, synthesizable, no three-state signals, low latency, single-trigger delay, support for multiple masters and bus arbitration (support for DMA and multiple CPU cores), high clock frequency independence, support for bursts (high efficiency), and low gate count. Following these basic principles can help us avoid many design risks and improve bus efficiency and IP reuse. Of course, some of the above principles, such as the "three-state bus", can and should be used in certain applications, but it is not recommended that chip and design engineers easily break these specifications and increase risks. Nanshan Bridge Microelectronics uses three-state bus technology in high-end chip design to solve the wiring aggregation and timing matching problems of ultra-wide buses.

Application requirements often determine the form of the bus, such as the embedded CPU bus structure often used in SoC chips. On the other hand, when we choose a CPU, in addition to cost, performance, power consumption, fast and accurate timing simulation model, compilation environment and available IP, another important point is whether its bus design is simple, efficient and conducive to the efficiency of other design modules.

For the more popular ARM processors, the AMBA bus standard is adopted. The AHB bus commonly used in high-speed chips has the following characteristics: pipeline, non-three-state bus, support for multiple masters, bus arbitration and centralized address decoding, response mechanism (non-real-time), and support for bursts.

In short, the AHB bus is suitable for exerting the efficiency of the CPU and conforms to the principle of efficient bus design. However, it also has bus width limitations (mainly instruction set width) and the complexity of SPLIT (splitting) option support. More than half of the designs I participated in do not support the SPLIT option to reduce design and verification overhead. Due to space limitations, I will not elaborate on this. The main problem is that the CPU bus in the SoC generally adopts a response mechanism, that is, non-real-time, and the data processing adopts an interrupt response mechanism to exert efficiency. There is no fixed delay and stable throughput rate for processing specific real-time data, so a module needs to be designed to handle the smooth transition between real-time data and non-real-time bus. The author takes this module design as an example to explain the design concept and several practical technologies of real-time data switching in non-real-time bus. In the example, real-time data transmission adopts TDM bus form (Time Division Multiplexed), and we call this mode TDM module.

TDM module design

The interface at one end of the TDM module is the input and output of multiple audio signals, and the other end is AHB bus. The input/output of audio data usually adopts frame structure TDM form (see Figure 1). Among them, sp_io_xclk represents the audio data sampling clock, sp_io_xfs represents the frame synchronization header, and the following two lines are output and input data respectively. It can be seen that this is a multi-channel time-division real-time data transmission format with frame format. There are a lot of introductory materials about AMBA bus, which will not be repeated here.

In the design of this module, we considered the following principles: smooth matching of data transmission speed, low latency and low resource occupation (logic and storage resources), efficient use of AHB bus bandwidth, improved CPU processing efficiency, reliability and error handling, controllability and observability. The most basic idea is: use FIFO (first in first out) technology and queue to buffer data transmission, and try to cache as little data in the queue as possible to meet low latency and low resource occupation; use AHB burst mode to improve bus bandwidth utilization; finally, register read and write are provided to control transmission parameters and state storage, and AHB slave mode is used. The preliminary design structure is shown in Figure 2.

When to use DMA technology

In this preliminary design, the calculation of the length of the cache queue mainly depends on the speed and frequency of the AHB burst. To cache less data, AHB transfer must be performed frequently, that is, the CPU must be interrupted frequently, which reduces the CPU processing efficiency.

This seems to be an unsolvable contradiction, but we can use DMA (Direct Memory Access) technology to solve it. Generally, SoC chips have external DDR/SDRAM as the final data and program cache. The TDM module can directly transmit real-time data to DRAM without frequently interrupting the CPU. In essence, it transfers the demand for on-chip cache to the outside of the chip (assuming that the bus bandwidth is sufficient), which not only reduces the queue length but also reduces the frequency of interrupting the CPU, thus solving this contradiction. [page]

DMA technology is essentially the module actively taking the initiative in the bus, requiring the use of AHB bus master mode, and the final framework structure will become as shown in Figure 3.

The contradiction between delay and DMA application

Careful readers will find that the use of DMA increases the processing delay. Isn't this contradictory to our principles? This involves the understanding of the audio processing algorithm in the embedded CPU. Most of them are audio compression algorithms, which generally require a certain audio segment length to ensure the compression rate and reduce the scheduling overhead of the RTOS in the CPU. Other audio processing programs, such as the echo reduction DSP algorithm, often use a 64-beat finite filter to process the echo tail greater than 16ms. Other highly compressed algorithms (such as algorithms based on finite excitation parameter models) require processing of longer audio segments. Therefore, from the perspective of the algorithm, the theoretical lower limit of the audio processing delay of the SoC system is the maximum value of the multi-algorithm processing unit. We only need to ensure that the transmission data delay of the DMA is less than this lower limit, so that the minimum delay of the SoC system is fully utilized, and then there is a basis for calculating the length of the DMA segment.

Back to the calculation of the queue length, we now only need to consider the worst value of the gap between the TDM module obtaining the right to use the AHB bus and the speed difference of the TDM data input.

Queue depth = longest AHB bus acquisition interval × TDM input rate

The AHB bus polling (poll) interval depends on the number of master mode modules on the bus and the priority strategy of arbitration. It is generally recommended that the real-time module enjoys a higher priority. Of course, the requirement that the frequency of bus application should not be too high. The solution to balance this contradiction is beyond the scope of this article. Readers can start with the arbitration mechanism of "fixed weight plus priority competition" to design the AHB bus arbiter.

Dynamic switching timing and the use of shadow registers

In actual applications, we often find that many time-division channels in the frame format do not have audio data. At this time, we need to use time-division masks to shield these channels to prevent invalid data from occupying bandwidth. The question is whether there is data in the time-division communication that changes dynamically. Dynamically changing data requires that the time-division mask parameters must also be dynamically allocated. But how to switch? Here, the "shadow register" technology is used. The principle is that there are two sets of registers, one set of parameters is applied to the current frame, and the other set is applied to the next frame. Use a clock cycle of the frame synchronization header to switch in real time. The CPU in the SoC only sees one set of register addresses. At the same time, the configuration behavior itself relaxes the restrictions on real-time requirements. The real-time switching is completed by the TDM module. The specific diagram is shown in Figure 4.

Error handling-the last straw

As we all know, there is no next chance for chip design, so the handling of errors becomes a "life-saving straw". Suppose the TDM module has not been controlled by the bus for a long time, and underrun (too low rate) and overrun (too high rate) occur. It is necessary to use the "high-watermark" and "low-watermark" technologies in the queue to give early warnings before the queue is close to full and empty states. Warnings usually reflect some design problems in the chip system and instantaneous problems such as voltage fluctuations, interference, and local high temperatures at the time. At this time, the warning signal usually occurs with the highest/second highest priority interrupt. The ARM CPU itself supports high-priority interrupts, and our queue length calculation now needs to be recalculated, adding the high-priority processing period. For the specific response clock cycle, readers please refer to the corresponding CPU manual. This is also an indicator for evaluating embedded CPUs and real-time operating systems (RTOS).

Queue depth = longest AHB bus acquisition interval × TDM input rate + ARM longest interrupt response time × TDM input rate

Summary of this article

In the brief design of the TDM module, we have explained the process and design ideas of combining various basic technologies, such as from cache queue to DMA to shadow register to dynamic allocation to watermark and using DSP algorithm characteristics, AHB bus characteristics, frame synchronization characteristics and RTOS characteristics to solve the contradiction between non-real-time and real-time switching, CPU efficiency and resource occupation, delay and DMA configuration and dynamic switching, and pursue the optimal solution.

This article does not give the queue calculation formula of the initial solution because there are too many factors to consider, which reflects from another aspect that it is not the optimal solution. A good design should simplify and modularize complex requirements. Of course, the actual design is more complicated than this simplified design. For example, it is necessary to solve the problem of clock asynchrony between the two parties in real-time data transmission. However, as long as readers master the basic ideas and technologies, understand the application characteristics, CPU characteristics and RTOS characteristics and algorithms, they can draw inferences from one example and make the best design.

Reference address：Design Concept and Application of Real-time Bus Module in SoC System

Previous article：Realize and pay attention to the design and application of power-off protection in embedded systems
Next article：Factory automation problems solved with STM32-F2 comprehensive solutions

Popular Resources
Popular amplifiers