When implementing a PCIe interface in an FPGA, designers must consider how data is transferred in order to ensure system efficiency, system jitter, data clock overhead, and the total end-to-end bandwidth requirements that must be met. Combining a scatter/gather DMA (SGDMA) with a PCIe interface can greatly help hardware and software designers meet their design requirements by offloading some of the data transfer burden from the local processor and amortizing hardware delays across multiple channels. This article will discuss some of the advantages of using FPGA-based SGDMA in combination with PCIe.
Most new DMA controllers support scatter/gather functionality, where the hardware is responsible for transferring data that resides in non-contiguous physical memory, eliminating the need for the host processor to handle the data transfer between modules. This is achieved by "chaining" DMA buffer descriptors together, each of which contains all the information required for SGDMA to automatically perform the data transfer. SGDMA is most commonly used when dealing with operating systems that support virtual memory.
Direct Memory Access Implementation
There are several ways to measure the benefits of a DMA controller: by quantifying how many clock cycles are actually recaptured? How much is the complexity of the associated device driver reduced? Or how much does the DMA engine increase the overall system throughput by "hiding" the overhead of data distribution? Some of the features supported by the enhanced DMA engine include:
* Online packet buffer storage
* Simultaneous transfer of local and front-side bus
* Delayed processing
* Terminal transaction processing interception
* Provides independent arbitration for each bus connected to the DMA controller
In-line data buffers can be used when the front-end and local buses are under very high load conditions and the DMA controller must contend for and acquire bus ownership before transferring data.
Figure 1: The SGDMA controller in a virtual memory environment.
Using block memory in an FPGA is always a tradeoff between how much temporary buffering is needed due to bus occupancy? Does the function in the FPGA require memory? What is the cost of the additional latency incurred by using intermediate data storage? If possible, an inline packet buffer can reduce some system latency by using a bus to read data while the "store" bus does not transfer data.
For example, when the PCIe receive and transmit virtual channel (VC) buffers are small, allowing one side of the DMA controller to transfer data to the VC buffers, or to send data from the VC buffers to the local packet memory, can improve the flow control credit level, thereby reducing the interdependence of PCIe link and local memory utilization. In addition, when the I/O bus side of the DMA controller is communicating with the PCIe core, the front-side bus interface can simultaneously transfer data that is subsequently transmitted by the PCIe core to the packet buffer. This will certainly increase some latency, but it is better than stalling the DMA controller due to bus contention, and concurrent work on the DMA controller can improve the overall latency of data transfer.
In PCIe, read operations are supported by splitting the processing. When a read request is issued, the data to be transferred on the PCIe link will not be immediately available. In this case, the DMA controller that supports delayed processing will automatically give up bus control and allow any other activated channels in the DMA to compete for bus control.
Figure 2: PCIe with DMA high-level architecture.
Some of the most demanding applications of the PCIe serial protocol are those that require real-time or near real-time data transfer. In such systems, such as voice and video processing, the use of arithmetic engines is required to meet the strict processing time requirements of data blocks. These hard constraints not only place an additional burden on the software work inside the arithmetic chip to process data and reduce latency, but also on the stream processing hardware. One way to provide smaller packet latency and higher system throughput is to break the data blocks into smaller packets before the data enters the system backplane for transmission. This allows for smaller receive buffers and ensures that no single data engine is overloaded.
When using smaller data packets, the following issues must be addressed:
1. PCIe has a relatively fixed overhead for all transaction layer packets (TLPs); more packets require more processing overhead
2. Smaller packets usually require less processing time, which increases the number of concurrent interrupts initiated by the data engine and the PCIe interface.
3. To maintain proper load balancing, smaller packets increase the load on the local processor
4. The local host processor must spend more time to generate the data transfer TLP used by the PCIe protocol
The above points mean that the local host processor will lose more clock cycles that were originally used for other functions. Therefore, smaller data packets help reduce the processing delay of the physical interface, but at the cost of increasing the load on the end system, which may reduce the overall system performance. Although the PCIe TLP overhead processing delay cannot be completely eliminated, by adopting a multi-channel scatter/gather DMA engine, dividing the data block request into smaller units of variable-sized data packets based on a flexible arbitration mechanism, and designing transaction splitting support capabilities in the DMA controller itself, the delay associated with each flow level (TC) on each channel can be diluted. In addition, designing a smaller TLP transaction IP core can help improve software efficiency by generating/terminating PCIe TLP.
Figure 3: PCIe read/write transactions with DMA.
For PCIe, memory reads (MRd) are not prioritized and are executed as a split transaction, while memory writes (MWr) are prioritized. For reads, the requester first sends an MRd TLP to request the completor algorithm to send a large amount of data (usually the largest read request is 512 bytes), and then waits exclusively for the data to be sent. The PCIe MWr TLP contains the full payload to be written (usually up to 128 bytes). Therefore, the MLRd TLP also requires a bandwidth in the transmit direction, just like the MWr TLP. By allocating more resources to the MW channel, the pipeline will be kept full in the transmit (Tx) direction, while the receive (Rx) pipeline is filled with data TLPs in response to MRd requests, see Figure 2.
Benefits in software execution time
A feature-rich scatter/gather DMA controller can also reduce software development effort and CPU execution time by implementing functions that would otherwise require complex algorithms and/or large numbers of interrupts:
* All state-of-the-art processors and operating systems, including the best real-time operating systems (RTOS), use MMUs and virtual memory. Even the kernel uses virtual addresses. This means that the DMA cannot access a buffer in system memory linearly. When the buffer is close to the process, it is actually spread out across physical memory in PAGE_SIZE blocks. A scatter/gather DMA helps the processor and software driver by allowing each buffer descriptor to be mapped to a physical page of memory. Without a scatter/gather list in the local buffer descriptor, the driver can only move one page of data at a time before restarting the DMA to move the next page, which greatly affects system performance.
* Typically, a system consists of multiple execution threads. These threads may all need to transfer data. If a DMA has multiple channels and assigns a thread to each channel, system performance can be improved through these more parallel processing.
* If the CPU operates in little-endian mode and transfers TCP/IP packets to the MAC, you are usually forced to use software routines to swap bytes with the network order (big-endian). A DMA that can perform this conversion in hardware in immediate mode can reduce software complexity and shorten system design time.
* For efficiency, the PCIe bus interface should be as wide as possible (64 bits), but many peripherals have only narrow bandwidth (16 or 32 bits). If DMA is used for bus re-adaptation, there is no impact on the performance of the PCIe interface, and DMA can perform dual or quad access to smaller peripherals before building high bandwidth transfer to the PCIe interface logic.
* It provides an adaptation layer that converts packet-based TLP data streams into parallel bus access to linear memory. This will bring huge benefits to designers who reuse IP modules that already have memory interfaces (address bus, data bus, control lines). They can easily configure the IP module to the bus served by DMA.
Conclusion
By leveraging advanced payload storage data engine controllers such as the Scatter/Poly DMA controller, FPGA system designers can improve the throughput and latency deficiencies commonly found in both hardware and software associated with PCIe-based system designs.
Previous article:NAND Flash driver design solution
Next article:Memory chip parameters
Recommended ReadingLatest update time:2024-11-16 19:33
- Popular Resources
- Popular amplifiers
- Embedded high-speed serial bus technology - implementation and application based on FPGA
- Semantic Segmentation for Autonomous Driving: Model Evaluation, Dataset Generation, Viewpoint Comparison, and Real-time Performance
- Research on PCIe bus based on FPGA for ultrasonic phased array system
- Microcomputer Principles and Interface Technology 3rd Edition (Zhou Mingde, Zhang Xiaoxia, Lan Fangpeng)
- High signal-to-noise ratio MEMS microphone drives artificial intelligence interaction
- Advantages of using a differential-to-single-ended RF amplifier in a transmit signal chain design
- ON Semiconductor CEO Appears at Munich Electronica Show and Launches Treo Platform
- ON Semiconductor Launches Industry-Leading Analog and Mixed-Signal Platform
- Analog Devices ADAQ7767-1 μModule DAQ Solution for Rapid Development of Precision Data Acquisition Systems Now Available at Mouser
- Domestic high-precision, high-speed ADC chips are on the rise
- Microcontrollers that combine Hi-Fi, intelligence and USB multi-channel features – ushering in a new era of digital audio
- Using capacitive PGA, Naxin Micro launches high-precision multi-channel 24/16-bit Δ-Σ ADC
- Fully Differential Amplifier Provides High Voltage, Low Noise Signals for Precision Data Acquisition Signal Chain
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- How to Make LED Bulbs Dimmable
- (All delivered, order number can be checked) The shortlist is here - 2020 ST MEMS Sensor Creative Design Competition
- [Content synchronization] A quick review of RVB2601 article synchronization specifications to avoid repeated returns and modifications
- This indicator has an EVM indicator in the previous high-speed rail or subway PA.
- Domestic MCU replacement: a pitfall
- Can the power supply master give me some advice on how to calculate the secondary winding wire diameter of the flyback switching power supply transformer?
- 【GD32L233C-START Review】Driver of serial digital tube display module
- [ESP32-S2-Kaluga-1 Review] Development Environment Selection and Construction
- Do foldable electronic devices make sense?
- Shanghai recruits photoelectric sensor digital front-end design engineers (400,000-600,000)