Scatter/Gather DMA for Embedded Systems with PCIe-EEWORLD

Collect

The demands of the new generation of interconnection continue to put pressure on software and hardware design. Increasing quality of service (QoS), data channel isolation, data smooth recovery and integrity are all indicators worth considering. PCIe is an interconnection technology that can meet the above requirements.

When implementing a PCIe interface in an FPGA, designers must consider how data is transferred in order to ensure system efficiency, system jitter, data clock overhead, and the total end-to-end bandwidth requirements that must be met. Combining a scatter/gather DMA (SGDMA) with a PCIe interface can greatly help hardware and software designers meet their design requirements by offloading some of the data transfer burden from the local processor and amortizing hardware delays across multiple channels. This article will discuss some of the advantages of using FPGA-based SGDMA in combination with PCIe.

Most new DMA controllers support scatter/gather functionality, where the hardware is responsible for transferring data that resides in non-contiguous physical memory, eliminating the need for the host processor to handle the data transfer between modules. This is achieved by "chaining" DMA buffer descriptors together, each of which contains all the information required for SGDMA to automatically perform the data transfer. SGDMA is most commonly used when dealing with operating systems that support virtual memory.

Direct Memory Access Implementation

There are several ways to measure the benefits of a DMA controller: by quantifying how many clock cycles are actually recaptured? How much is the complexity of the associated device driver reduced? Or how much does the DMA engine increase the overall system throughput by "hiding" the overhead of data distribution? Some of the features supported by the enhanced DMA engine include:

* Online packet buffer storage

* Simultaneous transfer of local and front-side bus

* Delayed processing

* Terminal transaction processing interception

* Provides independent arbitration for each bus connected to the DMA controller

In-line data buffers can be used when the front-end and local buses are under very high load conditions and the DMA controller must contend for and acquire bus ownership before transferring data.

Figure 1: The SGDMA controller in a virtual memory environment.

Using block memory in an FPGA is always a tradeoff between how much temporary buffering is needed due to bus occupancy? Does the function in the FPGA require memory? What is the cost of the additional latency incurred by using intermediate data storage? If possible, an inline packet buffer can reduce some system latency by using a bus to read data while the "store" bus does not transfer data.

For example, when the PCIe receive and transmit virtual channel (VC) buffers are small, allowing one side of the DMA controller to transfer data to the VC buffers, or to send data from the VC buffers to the local packet memory, can improve the flow control credit level, thereby reducing the interdependence of PCIe link and local memory utilization. In addition, when the I/O bus side of the DMA controller is communicating with the PCIe core, the front-side bus interface can simultaneously transfer data that is subsequently transmitted by the PCIe core to the packet buffer. This will certainly increase some latency, but it is better than stalling the DMA controller due to bus contention, and concurrent work on the DMA controller can improve the overall latency of data transfer.

In PCIe, read operations are supported by splitting the processing. When a read request is issued, the data to be transferred on the PCIe link will not be immediately available. In this case, the DMA controller that supports delayed processing will automatically give up bus control and allow any other activated channels in the DMA to compete for bus control.

Figure 2: PCIe with DMA high-level architecture.

Some of the most demanding applications of the PCIe serial protocol are those that require real-time or near real-time data transfer. In such systems, such as voice and video processing, the use of arithmetic engines is required to meet the strict processing time requirements of data blocks. These hard constraints not only place an additional burden on the software work inside the arithmetic chip to process data and reduce latency, but also on the stream processing hardware. One way to provide smaller packet latency and higher system throughput is to break the data blocks into smaller packets before the data enters the system backplane for transmission. This allows for smaller receive buffers and ensures that no single data engine is overloaded.

When using smaller data packets, the following issues must be addressed:

1. PCIe has a relatively fixed overhead for all transaction layer packets (TLPs); more packets require more processing overhead

2. Smaller packets usually require less processing time, which increases the number of concurrent interrupts initiated by the data engine and the PCIe interface.

3. To maintain proper load balancing, smaller packets increase the load on the local processor

4. The local host processor must spend more time to generate the data transfer TLP used by the PCIe protocol

The above points mean that the local host processor will lose more clock cycles that were originally used for other functions. Therefore, smaller data packets help reduce the processing delay of the physical interface, but at the cost of increasing the load on the end system, which may reduce the overall system performance. Although the PCIe TLP overhead processing delay cannot be completely eliminated, by adopting a multi-channel scatter/gather DMA engine, dividing the data block request into smaller units of variable-sized data packets based on a flexible arbitration mechanism, and designing transaction splitting support capabilities in the DMA controller itself, the delay associated with each flow level (TC) on each channel can be diluted. In addition, designing a smaller TLP transaction IP core can help improve software efficiency by generating/terminating PCIe TLP.

Figure 3: PCIe read/write transactions with DMA.

For PCIe, memory reads (MRd) are not prioritized and are executed as a split transaction, while memory writes (MWr) are prioritized. For reads, the requester first sends an MRd TLP to request the completor algorithm to send a large amount of data (usually the largest read request is 512 bytes), and then waits exclusively for the data to be sent. The PCIe MWr TLP contains the full payload to be written (usually up to 128 bytes). Therefore, the MLRd TLP also requires a bandwidth in the transmit direction, just like the MWr TLP. By allocating more resources to the MW channel, the pipeline will be kept full in the transmit (Tx) direction, while the receive (Rx) pipeline is filled with data TLPs in response to MRd requests, see Figure 2.

Benefits in software execution time

A feature-rich scatter/gather DMA controller can also reduce software development effort and CPU execution time by implementing functions that would otherwise require complex algorithms and/or large numbers of interrupts:

* All state-of-the-art processors and operating systems, including the best real-time operating systems (RTOS), use MMUs and virtual memory. Even the kernel uses virtual addresses. This means that the DMA cannot access a buffer in system memory linearly. When the buffer is close to the process, it is actually spread out across physical memory in PAGE_SIZE blocks. A scatter/gather DMA helps the processor and software driver by allowing each buffer descriptor to be mapped to a physical page of memory. Without a scatter/gather list in the local buffer descriptor, the driver can only move one page of data at a time before restarting the DMA to move the next page, which greatly affects system performance.

* Typically, a system consists of multiple execution threads. These threads may all need to transfer data. If a DMA has multiple channels and assigns a thread to each channel, system performance can be improved through these more parallel processing.

* If the CPU operates in little-endian mode and transfers TCP/IP packets to the MAC, you are usually forced to use software routines to swap bytes with the network order (big-endian). A DMA that can perform this conversion in hardware in immediate mode can reduce software complexity and shorten system design time.

* For efficiency, the PCIe bus interface should be as wide as possible (64 bits), but many peripherals have only narrow bandwidth (16 or 32 bits). If DMA is used for bus re-adaptation, there is no impact on the performance of the PCIe interface, and DMA can perform dual or quad access to smaller peripherals before building high bandwidth transfer to the PCIe interface logic.

* It provides an adaptation layer that converts packet-based TLP data streams into parallel bus access to linear memory. This will bring huge benefits to designers who reuse IP modules that already have memory interfaces (address bus, data bus, control lines). They can easily configure the IP module to the bus served by DMA.

Conclusion

By leveraging advanced payload storage data engine controllers such as the Scatter/Poly DMA controller, FPGA system designers can improve the throughput and latency deficiencies commonly found in both hardware and software associated with PCIe-based system designs.

Keywords：PCIe Reference address：Scatter/Gather DMA for Embedded Systems with PCIe

Previous article：NAND Flash driver design solution
Next article：Memory chip parameters

Recommended ReadingLatest update time:2024-11-16 19:33

[Enjoy the ride] Series 1: Your future ride will be a mobile data center! Bringing more demands to PCIe testing

The "informatization and intelligence" of automobiles has brought a new concept to the automotive industry, software-defined automobiles. It means that the amount and value of in-car software (including electronic hardware) exceeds mechanical hardware, and represents the gradual transformation of the automotive indust

[Automotive Electronics]

[Enjoy the ride] Series 1: Your future ride will be a mobile data center! Bringing more demands to PCIe testing

DMA implementation in STM32

　　The STM32 series microcontrollers all contain DMA and universal clock TIMx modules. The low-end models only include DMA1, which supports 7 channels; the high-end models also include DMA2, which supports 5 channels. Each of its channels can specify any working mode, such as memory to memory, memory to peripherals, or

[Microcontroller]

STM32F4 - DMA request mapping table

[Microcontroller]

STM32 ADC_DMA

This is just a measurement trend, not a very precise measurement. Once again, we reiterate that different channels of the STM32 ADC correspond to different pins. In this code, PA1 corresponds to channel 1. Included Files: Due to Baidu's word limit, only the key codes are posted here: （1）Main C languag

[Microcontroller]

Popular Resources
Popular amplifiers