Understanding the activity of the system bus can help development engineers significantly improve the performance of embedded applications. In the past, monitoring the activity of the system bus was a challenging problem due to the lack of sophisticated hardware and software features in embedded processors. Understanding the behavior of the application at the system level is critical to effectively utilize system resources, including external memory, DMA controllers, arbitration, system bus interconnects, etc.
The Blackfin BF54x series processors provide performance counters (index registers) that help application developers understand the behavior of applications at the system level. After understanding the behavior of applications, developers can use some system optimization techniques to improve performance and reduce power consumption.
In this article, various configurations of performance index registers are introduced, and hardware and software interface examples for using them on Blackfin processors are provided. In addition, methods for improving performance are given for some typical application scenarios.
Definition of index registers
In a typical real-world application, there are multiple resources, such as the core processor, peripheral DMA, and MDMA (memory-to-memory DMA) that can access external memory and several system buses simultaneously. Performance index registers provide a way to capture the number of external memory bank accesses, page misses, bus traffic, and bus turnarounds. Effective use of the data obtained from these registers can significantly improve system resource utilization.
Table 1 shows the index registers provided by the Blackfin BF54x series processors and their brief descriptions.
We can use the memory bank read/write registers, bank activation count registers, and bus steering registers to improve the application's code and data external memory layout. The grant count register (EBIU_DDRGCx) helps to properly define the system arbitration strategy and achieve high system throughput.
We can reduce external memory latency by leveraging the temporal and spatial locations of code and data items mapped to external memory. Typically, capturing the spatial location and temporal locations of an application requires recording traces of code and data objects during program execution. However, for some simple applications, leveraging key data from pointer registers can reveal poorly mapped code and data items in external memory.
Below we explore some application scenarios and some simple techniques for optimization using the information obtained from these indicator registers.
Example Usage
The following will introduce how to analyze and interpret the information obtained from the indicator registers, and based on this, discuss how to use simple optimization techniques to improve the performance of the application.
1 Example 1
In this example, multiple data buffers are mapped to external memory, and the memory DMA channel is used to copy the contents of one set of buffers to another set of buffers. There are 4 buffers in this experiment, all of which are 32KB in size. All buffers are mapped to Bank0 of DDR and placed consecutively starting from address 0×0. Figure 1 shows the default layout of four buffers mapped to external memory. In this example, two memory DMA channels use the auto-buffer mode to continuously transfer the contents of two buffers to the other two buffers. The following is a three-step process that uses the information obtained from the indicator register and uses some system optimization techniques accordingly, which can improve the performance to 1.5 times the original system.
Figure 1 Index register data without optimization
Step 1 Basic System Performance
We use the average throughput of the system to quantify the performance of the system. The average throughput is calculated as follows:
Average throughput = "Total number of data bytes read and written to DDR memory" / second
The time interval of system bus activity is set using a kernel timer. The timer is set to generate an interrupt when the time interval set in the experiment is reached. The timer is started before the memory DMA channel starts to be enabled, and then the memory DMA channel is disabled in the kernel timer ISR. The amount of data transferred is measured in the interrupt service routine of the DMA channel using the corresponding counter. An interrupt is generated for each buffer transfer, and the counter is incremented by 1 each time the DMA ISR is called. Since all memory DMA channels are running in auto-buffer mode, channel interrupt latency does not need to be included in the final throughput calculation. For this measurement, the timer interrupt latency is not included due to its small value.
Table 2 shows the baseline performance of this system. From this table, we can see that even with such a simple system, we are only utilizing a small portion of the total available bandwidth. The indicator registers allow us to see the activity of the system bus and help us understand why the performance is low. Based on this information, we will be able to apply certain optimization techniques to improve performance. [page]
Step 2 Using the Indicator Registers
For these cases, external memory latency is often the cause of poor throughput. We will first look at the total number of DDR read/write accesses and the total number of off-page DDR accesses.
As can be seen in Figure 1, the read and write accesses of the counting registers show that the accesses are only to one bank (bank 0), and the number of page activations accounts for 25% of the total number of accesses. This means that the spatial locality of DMA accesses in the same bank is small due to the cache mapping to different pages of bank 0. Since the source and target caches are on different pages, there is an off-page access for each DMA access.
Step 3 Improve performance
Placing caches in different DDR groups can reduce off-page accesses. If caches are placed in different groups, off-page accesses will only occur when a channel crosses a page boundary. The DDR controller of Blackfin BF54x supports up to 8 internal DDR groups to be opened at the same time, so four caches can be mapped to different groups.
2 Example 2
In Example 1 above, only a few resources (two MDMA channels) access a single DDR memory bank and the system behavior does not change over time. Therefore, a snapshot of the indicator register can be taken to understand the activity of the system bus and capture spatial locality. In a more realistic system, there may be multiple resources (cores, multiple DMA channels) accessing multiple DDR memory banks and the system bus, causing the DDR data access pattern to change rapidly over a small time interval. In these cases, it is difficult to capture the spatial locality and system behavior with just one snapshot of the indicator register. Therefore, it is necessary to capture the bus activity at multiple points during application execution to explore spatial locality.
To illustrate this, consider a case where bus activity in time interval T shows that accesses to all groups are balanced, but with a high proportion of off-page accesses, but bus activity recorded in smaller time intervals (T1, T2, where T1+T2=T) shows that accesses to the groups are uneven, see Figure 2. If the cache layout can be optimized for time intervals T1 and T2 separately, it is possible to significantly improve system performance.
Figure 2 System bus activity during time intervals T, T1, and T2 (T > T1+T2)
The difficulty lies in finding time intervals where system resources are accessed in a consistent manner, so that the same set of optimization techniques can be used. This may require multiple iterations of analyzing the application.
Experimental setup for periodically capturing indicator register data
In this section, the experimental setup for periodically recording indicator register data is introduced. As shown in Figure 3, a PC is used as the host to collect data from the Blackfin through the background telemetry channel (BTC) that communicates using the JTAG interface. The data logging program runs on the PC and periodically sends BTC instructions to the Blackfin processor. In response, the Blackfin processor sends a snapshot of the indicator registers to the host.
Figure 3 Experimental setup for periodically capturing indicator register data
The Blackfin processor uses a general-purpose timer to generate interrupts periodically. When the timer issues an interrupt, the contents of the index register are read out and stored in memory. When the host issues a request, the stored index register data is sent to the PC through the BTC channel. The BTC channel supports data transfer rates up to 3Mbps.
Now consider an example program where multiple buffers are mapped in the DDR memory and memory DMA\'s are used to transfers data between these buffers.
Figure 4 Example of multiple data transfers in external DDR memory
In this example, MDMA0 transfers 4KB of data from srcBuffer0 to dstBuffer0 and MDMA1 transfers 4KB of data from srcBuffer01 to dstBuffer1. Initially, only MDMA0 is enabled. After the MDMA0 data transfer is completed, the MDMA1 channel is enabled and vice versa. This causes the number of memory bank accesses to vary in different time intervals. In this example, a snapshot of the indicator register shows the following (see Figure 5). From this number, it is not possible to tell which memory bank caused the page miss and which data stream channel is responsible for the page miss. Observing the indicator register multiple times periodically can help us find the cause of low bandwidth utilization.
Figure 5 A snapshot of the indicator register data in Example 2 [page]
We will use the above experimental setup to record the index register data. The index register data obtained on the PC can be used to draw a correlation graph between page misses and memory bank accesses, and the data can be analyzed using a mathematical toolbox such as MATLAB. From the graph, it can be seen that most page misses are caused by memory bank 0 accesses.
Figure 6 Correlation between Page Miss and DDR Bankx Access
Figure 7 Memory bank access and page miss
Figure 8: Unoptimized layout of Example 2
Figure 9 Cache layout optimization
Using the linker description file (ldf) or using the Blackfin processor memory window, you can determine which caches are mapped to these groups and remap them to other groups separately, thereby reducing page misses.
Bus Grant Count Registers
The Bus Grant Count Registers (EBIU_DDRGCx) help us understand the resource utilization of the individual system buses (EAB and DEBx buses). In practice, this will help determine the bus arbitration strategy and ensure efficient DMA and external memory resource sharing.
The Blackfin BF54x processor family provides programmable priority settings for external buses. In addition, the processors map several peripheral DMAs and memory DMAs to multiple DMA controllers, providing additional flexibility for efficient resource management.
Consider an example of obtaining video data from a camera. The compression algorithm runs on the Blackfin, and the compressed video data is sent from the Blackfin to the PC via the USB bus. The observations show that the USB throughput is quite low and cannot transmit the compressed video data in real time. One possible reason is that the USB bus is suspended due to other high-priority tasks in the system. For this case, we can use the grant count register to quickly verify. As above, we observe the data of the indicator register over a period of time. The data of the indicator register over several time intervals reveals that the DEB2 bus (USB bus) is competing with the EAB bus (core bus), thereby limiting the USB access to the DDR memory.
By default, the kernel has a higher priority than the USB interface for external memory access. For the current application, the real-time requirements of the USB bus have a higher priority than the kernel. Therefore, we must use one of the bus arbitration registers to increase the priority of the USB relative to the kernel to solve this problem.
The bus grant count register can also be used in conjunction with the memory bank access register to understand which bus is most active in a given time interval and find correlations between page misses and bus activity in a given time interval. Information such as the memory bank access count, the bus causing the page miss, and which resources are utilizing the bus can reveal inefficient code or data memory layouts.
Previous article:Design of a portable automatic test system for new radar digital circuits
Next article:Design and implementation of RS485 interface error code tester based on FPGA
- Popular Resources
- Popular amplifiers
- Machine Learning and Embedded Computing in Advanced Driver Assistance Systems (ADAS)
- Embedded Systems with RISC-V and ESP32-C3 - A practical introduction to architecture, peripherals and
- Multiplexed Networks for Embedded Systems: CAN, LIN, FlexRay, Safe-by-Wire
- Principles and Applications of Single Chip Microcomputers and Embedded Systems
- Keysight Technologies Helps Samsung Electronics Successfully Validate FiRa® 2.0 Safe Distance Measurement Test Case
- From probes to power supplies, Tektronix is leading the way in comprehensive innovation in power electronics testing
- Seizing the Opportunities in the Chinese Application Market: NI's Challenges and Answers
- Tektronix Launches Breakthrough Power Measurement Tools to Accelerate Innovation as Global Electrification Accelerates
- Not all oscilloscopes are created equal: Why ADCs and low noise floor matter
- Enable TekHSI high-speed interface function to accelerate the remote transmission of waveform data
- How to measure the quality of soft start thyristor
- How to use a multimeter to judge whether a soft starter is good or bad
- What are the advantages and disadvantages of non-contact temperature sensors?
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Op amp offset voltage and single or dual power supply issues?
- Keysight Technologies' Thanksgiving Month Oscilloscope Award Photos
- Serial communication, 1-wire, TWI communication, what are the differences between these three communication methods? Who knows about it? Popular...
- How to filter out 50HZ power frequency interference introduced by active filtering
- Is it the inverter or the integrator circuit?
- I have lived for more than 30 years and my brushing posture is wrong!
- Questions about the MSP430 microcontroller architecture: MSP430, MSP430X, MSP430Xv2
- Today at 10:00 AM, live broadcast with prizes: [Infineon Intelligent Motor Drive Solution]
- Technical staff salary
- RISC-V MCU Development Practice (Part 4): Stepper Motor