Design of Double-Layer AMBA Bus and Its Application in SoC Chip Design-EEWORLD

Collect

Generally speaking, SoC chips are composed of on-chip cores, user-designed IP cores, and a bus that integrates the two. The on-chip cores determine which on-chip bus to use and the chip architecture. The ARM series of embedded microprocessors occupy a major share of the market with their high performance and low power consumption. ARM7TDMI is widely used in SoC chip design due to its relatively low price. At the same time, the AMBA (Advanced Microprocessors Bus Architecture) on-chip bus architecture developed by ARM has become a popular on-chip bus structure due to its own high performance and the wide application of ARM cores. In addition to the on-chip cores and on-chip buses, various IPs designed by users or provided by suppliers are also integrated on SoC chips. Figure 1 is a module structure diagram of a SoC chip based on ARM7TDMI for the consumer electronics field.

As shown in Figure 1, ARM7TDMI needs to access each slave through the bus; DMA also needs to access peripherals through the bus for data exchange when working; and the LCD controller module needs to continuously access the video memory through the bus to read data in order to achieve real-time display; other masters in the system also occupy the bus when working. The LCD controller module

should be paid special attention to. Color screen display requires a large amount of data. For example, a 320×240, 16bpp TFT color screen requires: 320×240×16/8=153.6kByte for each frame. Such a large amount of data cannot be provided by on-chip memory, and must be obtained from peripherals through the memory interface. Since the amount of data required by the LCD controller is large and needs to be displayed in real time, the work of the LCD controller will occupy a large amount of on-chip bus bandwidth, and even affect the normal operation of the entire system. In the current consumer electronics field, supporting color screen applications is almost indispensable.

This problem can be solved by adopting methods such as optimizing bus switching algorithms, increasing on-chip caches, and improving bus architectures. Among them, the performance improvement brought by optimizing the bus switching algorithm is relatively limited, and the complexity of the cache design itself and its high license cost make it unsuitable in many cases. Therefore, AMBA with a dual-bus architecture is a good choice.

Dual-bus architecture AMBA and its implementation

In the case of a single-layer bus, all masters and slaves are hung on the AHB bus. If any Master wants to access the Slave, it must first apply for the bus. After obtaining the ownership of the bus, it exchanges the address, data and control signals through the MUX in the bus interconnection structure, while other Masters must wait.

Double-layer AMBA bus structure

The double-layer AMBA bus architecture uses a more complex internal interconnection structure, which allows two groups of Masters and Slaves to interact with each other through AMBA at the same time, greatly improving the bandwidth of the bus. And any Master can access the Slave on any layer. In addition, after adopting the double-layer AMBA bus, it is transparent to the AHB Master and AHB Slave, and no modification is required.

Figure 2 is the internal structure diagram of the double-layer AMBA bus designed in this paper. For this double-layer AMBA bus, it is set to support 16 Masters and 16 Slaves, and each layer has 8 Masters and 8 Slaves.

The double-layer AMBA bus itself consists of three parts: the bus decoder, pre-arbitrator and multiple data selectors (MUX) of Layer 1; the bus decoder, pre-arbitrator and multiple data selectors of Layer 2; and the core arbiter of the entire bus. The first two are basically the same, and the core arbiter is the core of the entire double-layer bus architecture. The principle is: the eight masters of each layer first perform decoding and arbitration in their own layer, and the results are sent to the core arbiter, and then the core arbiter determines the state switching and how each MUX selects the data flow and control flow.

The design of the internal components

is combined with Figure 2 and the AMBA protocol. The following introduces the various components of this double-layer AMBA bus. Since the design and function of each component of the second layer are similar to those of the first layer, only the first layer is introduced.

* Layer 1 decoder

This decoder adopts a centralized address decoding mechanism, which is conducive to improving the portability of peripheral devices. The decoder receives the address signal sent by the Master currently occupying the bus, generates a chip select signal corresponding to each Slave, and sends it to the core arbiter. The chip select signal is generated by comparing with the base address of each slave.

It is worth noting that since each master can access any one of Slave0~Slave15, the decoder must be able to generate at least 16 chip select signals.

In addition, the decoder of each layer should have a default chip select signal corresponding to the default slave. The response of this default slave is divided into two cases: for IDLE or BUSY transmission, an OKAY response is made; for NONSEQU ENTIAL or SEQUENTIAL transmission, an ERROR response is made.

* The pre-arbitrator

arbiter of Layer1 receives the bus request signal (HBusReq) issued by each master and the judgment signal of the required bus switching, uses a certain bus arbitration algorithm, determines the master that can occupy the bus, and generates the control signal of M to S MUX1. Different from the single-layer AMBA, the HMaster_layer1 and BusHgrant_layer1 signals generated by it are sent to the core arbitrator instead of directly to each HMaster. In addition, the received current slave response is sent from the core arbitrator.

There are two bus switching algorithms that the arbitrator can use: fixed priority algorithm and round-robin priority algorithm. In the AMBA specification, the bus switching algorithm can be flexibly selected according to actual needs. In this component, a fixed priority algorithm is used, that is, Master0 has the lowest priority, and Master7 has the highest priority.

* Layer1's multiplexer

has a total of 4 MUXs in Layer1, namely M to S MUX1, M to S MUX2, S to M MUX1 and S to M MUX2. Among them, M to S MUX1 receives the signal of the Layer1 arbitrator as a chip select signal, selects one group from the 8 groups of bus signals and outputs it to the core arbitrator, Layer1's M to S MUX2 and Layer2's M to S MUX2. For M to S MUX2, its control signal is obtained from the core arbitrator, and its function is to select one group from the two groups of bus signals and send it to the corresponding Slave in Layer1. S to M MUX1 receives the chip select signal output by the core arbiter, and selects one group from the 8 groups of bus response signals (Hready, Hresp, Hrdata) of Layer1 to send to the core arbiter, S to M MUX2 of Layer1 and S to M MUX2 of Layer2. S to M MUX2 outputs a group of bus response signals to all the masters of Layer1.

* Core arbiter

The main function of the core arbiter is to get the initial state from the chip select signals output by the decoders of the two layers; then decide when to switch the state based on the response signal and transmission status of the slave; at the same time, according to its own state, output the corresponding signal to the relevant MUX as a control signal, output Hmaster and BusHgrant signals to the masters of each layer, and output the corresponding slave response signal to the pre-arbitrators of the two layers.

Since there are situations where masters from different layers access slaves from the same layer at the same time, the core arbitrator also needs to consider the bus switching algorithm. And because at most two masters will seize the bus in the core arbitrator, a simple round-robin priority algorithm can be used.

The main part of the core arbiter is a state machine, which consists of seven states:

IDLE: Enter this state after the system is reset to complete the initial assignment of some data;
M1S1M2S2: Layer1 Master communicates with Layer1 Slave, Layer2 Master communicates with Layer2 Slave, that is, the two-layer bus runs in parallel;
M1S2M2S1: Layer1 Master communicates with Layer2 Slave, Layer2 Master communicates with Layer1 Slave;
M1S1M2S1: Layer1 Master communicates with Layer1 Slave, and Layer2 Master is waiting for communication with Layer1 Slave;
M1S2M2S2: Layer1 Master communicates with Layer2 Slave, and Layer2 Master is waiting for communication with Layer2 Slave;
M2S1M1S1: Layer2 Master communicates with Layer1 Slave, and Layer1 Master is waiting for communication with Layer1 Slave;
M2S2M1S2: The Master of Layer 2 communicates with the Slave of Layer 2, and the Master of Layer 1 is waiting to communicate with the Slave of Layer 2.

The switching between these seven states is determined by the chip select signal given by the two-layer decoder, the control signal sent by the Master currently occupying the bus, and the response signal of the Slave communicating with this Master. When it comes to the state switching of the ARM Master, the three-level pipeline characteristics must be considered and appropriate waiting cycles must be given.

In addition, there is also a first-level input latch part in the core arbitrator, which is used to latch the address and control signals sent by the waiting Master.

Design results and establishment of test platform

For the above implementation, Verilog language is used to describe it at the RTL level, and Synopsys's VCS tool is used for functional simulation. In order to verify the correctness of the above design, for the architecture shown in Figure 1, the single-layer AMBA is changed to a double-layer AMBA, and the LCDC Master and LCDC Slave are moved to the second layer. At the same time, a simple MC Slave is added to the second layer, and the memory models of SRAM and SDRAM are hung outside it. The SDRAM is used to store the LCDC Master display memory data, and the other structures remain unchanged (as shown in Figure 3). At the same time, a set of test programs based on ARM assembly language is prepared to configure the system. After this test program is run, there are three Masters: ARM Master, DMA Master and LCDC Master will continuously access the bus.

The results show that the design is correct: ARM Master can configure the Slave of Layer2; while the LCDC Master of the second layer reads data from the MC Slave of the same layer, the Master of the first layer is accessing the Slave of the same layer; other Masters of Layer1 can also apply for the bus of Layer2 to access the external memory of Layer2.

In addition, in order to examine the occupancy rate of the LCD controller on the bus, an Hmaster Monitor submodule is hung on the AHB to count the number of clock cycles occupied by each Master on the current bus.

Comparison of the two bus modes

The design of the single-layer AMBA bus and the double-layer AMBA bus are compared from two aspects.

First, from the aspect of reducing the occupancy rate of the LCD controller bus. As can be seen from Table 1, when using a single-layer AMBA bus, the LCD controller occupies a relatively large bus bandwidth: for a typical 320×240, 16bpp TFT color screen, the LCD controller occupies 16.3% of the bus bandwidth. When using a double-layer AMBA bus, except for the bus cycle occupied by the ARM Master to configure the two Slaves, the LCD controller will only occupy the bandwidth of Layer2.

Secondly, from the results of synthesis, the area occupied by the double-layer AMBA is larger. In the case of including the APB module, the area obtained by the single-layer AMBA synthesis is 17,000 gates, while the area of the double-layer AMBA is 18,500 gates. Both support 16 Masters and 16 Slaves. The TSMC 0.25 process standard cell library is used, and the gate-level netlist is synthesized using Synopsys's Design Compiler tool.

For the actual application of the double-layer AMBA bus, the MC Slave of Layer1 can be connected to a non-volatile memory, while the MC Slave of Layer2 can be connected to a volatile memory. In this way, the instruction area can be placed in Layer 1, and the data area in Layer 2. Therefore, the instruction fetch operation of the ARM Master can be completed in Layer 1, and the LCD controller can read the video memory data in Layer 2. These two operations occupy a large amount of bus bandwidth, thus greatly reducing the waiting time of each Master due to bus preemption and improving the bus bandwidth.

Conclusion

ARM7TDMI has been widely used in the design of SoC chips, but because it does not have its own cache, it needs to access external memory frequently. If other modules that require a large data bandwidth are integrated on the chip at this time, the performance of the system will be greatly reduced. The double-layer AMBA bus can greatly improve the bus bandwidth and provide a more flexible system architecture under the condition of slightly increasing the occupied area. This has very important significance and practical value for SoC chips based on ARM7TDMI and other SoC chips with similar architectures.

Reference address：Design of Double-Layer AMBA Bus and Its Application in SoC Chip Design

Previous article：Cortex-M3 multi-tasking application design based on MDK RTX
Next article：Improving the A/D Resolution of LPC2138 Using Gradient Average Method

Popular Resources
Popular amplifiers