Measuring and Improving Embedded System Performance Using the System Bus

Publisher:XiangtanLatest update time:2011-07-21 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

Understanding the activity of the system bus can help development engineers significantly improve the performance of embedded applications. In the past, monitoring the activity of the system bus was a challenging problem due to the lack of sophisticated hardware and software features in embedded processors. Understanding the behavior of the application at the system level is critical to effectively utilize system resources, including external memory, DMA controllers, arbitration, system bus interconnects, etc.


The Blackfin BF54x series processors provide performance counters (index registers) that help application developers understand the behavior of applications at the system level. After understanding the behavior of applications, developers can use some system optimization techniques to improve performance and reduce power consumption.


In this article, various configurations of performance index registers are introduced, and hardware and software interface examples for using them on Blackfin processors are provided. In addition, methods for improving performance are given for some typical application scenarios.

Definition of index registers
In a typical real-world application, there are multiple resources, such as the core processor, peripheral DMA, and MDMA (memory-to-memory DMA) that can access external memory and several system buses simultaneously. Performance index registers provide a way to capture the number of external memory bank accesses, page misses, bus traffic, and bus turnarounds. Effective use of the data obtained from these registers can significantly improve system resource utilization.


Table 1 shows the index registers provided by the Blackfin BF54x series processors and their brief descriptions.


We can use the memory bank read/write registers, bank activation count registers, and bus steering registers to improve the application's code and data external memory layout. The grant count register (EBIU_DDRGCx) helps to properly define the system arbitration strategy and achieve high system throughput.


We can reduce external memory latency by leveraging the temporal and spatial locations of code and data items mapped to external memory. Typically, capturing the spatial location and temporal locations of an application requires recording traces of code and data objects during program execution. However, for some simple applications, leveraging key data from pointer registers can reveal poorly mapped code and data items in external memory.


Below we explore some application scenarios and some simple techniques for optimization using the information obtained from these indicator registers.

Example Usage
The following will introduce how to analyze and interpret the information obtained from the indicator registers, and based on this, discuss how to use simple optimization techniques to improve the performance of the application.


1 Example 1
In this example, multiple data buffers are mapped to external memory, and the memory DMA channel is used to copy the contents of one set of buffers to another set of buffers. There are 4 buffers in this experiment, all of which are 32KB in size. All buffers are mapped to Bank0 of DDR and placed consecutively starting from address 0×0. Figure 1 shows the default layout of four buffers mapped to external memory. In this example, two memory DMA channels use the auto-buffer mode to continuously transfer the contents of two buffers to the other two buffers. The following is a three-step process that uses the information obtained from the indicator register and uses some system optimization techniques accordingly, which can improve the performance to 1.5 times the original system.

Figure 1 Index register data without optimization


Step 1 Basic System Performance
We use the average throughput of the system to quantify the performance of the system. The average throughput is calculated as follows:
Average throughput = "Total number of data bytes read and written to DDR memory" / second
The time interval of system bus activity is set using a kernel timer. The timer is set to generate an interrupt when the time interval set in the experiment is reached. The timer is started before the memory DMA channel starts to be enabled, and then the memory DMA channel is disabled in the kernel timer ISR. The amount of data transferred is measured in the interrupt service routine of the DMA channel using the corresponding counter. An interrupt is generated for each buffer transfer, and the counter is incremented by 1 each time the DMA ISR is called. Since all memory DMA channels are running in auto-buffer mode, channel interrupt latency does not need to be included in the final throughput calculation. For this measurement, the timer interrupt latency is not included due to its small value.


Table 2 shows the baseline performance of this system. From this table, we can see that even with such a simple system, we are only utilizing a small portion of the total available bandwidth. The indicator registers allow us to see the activity of the system bus and help us understand why the performance is low. Based on this information, we will be able to apply certain optimization techniques to improve performance. [page]


Step 2 Using the Indicator Registers
For these cases, external memory latency is often the cause of poor throughput. We will first look at the total number of DDR read/write accesses and the total number of off-page DDR accesses.


As can be seen in Figure 1, the read and write accesses of the counting registers show that the accesses are only to one bank (bank 0), and the number of page activations accounts for 25% of the total number of accesses. This means that the spatial locality of DMA accesses in the same bank is small due to the cache mapping to different pages of bank 0. Since the source and target caches are on different pages, there is an off-page access for each DMA access.


Step 3 Improve performance
Placing caches in different DDR groups can reduce off-page accesses. If caches are placed in different groups, off-page accesses will only occur when a channel crosses a page boundary. The DDR controller of Blackfin BF54x supports up to 8 internal DDR groups to be opened at the same time, so four caches can be mapped to different groups.


2 Example 2
In Example 1 above, only a few resources (two MDMA channels) access a single DDR memory bank and the system behavior does not change over time. Therefore, a snapshot of the indicator register can be taken to understand the activity of the system bus and capture spatial locality. In a more realistic system, there may be multiple resources (cores, multiple DMA channels) accessing multiple DDR memory banks and the system bus, causing the DDR data access pattern to change rapidly over a small time interval. In these cases, it is difficult to capture the spatial locality and system behavior with just one snapshot of the indicator register. Therefore, it is necessary to capture the bus activity at multiple points during application execution to explore spatial locality.


To illustrate this, consider a case where bus activity in time interval T shows that accesses to all groups are balanced, but with a high proportion of off-page accesses, but bus activity recorded in smaller time intervals (T1, T2, where T1+T2=T) shows that accesses to the groups are uneven, see Figure 2. If the cache layout can be optimized for time intervals T1 and T2 separately, it is possible to significantly improve system performance.

Figure 2 System bus activity during time intervals T, T1, and T2 (T > T1+T2)


The difficulty lies in finding time intervals where system resources are accessed in a consistent manner, so that the same set of optimization techniques can be used. This may require multiple iterations of analyzing the application.

Experimental setup for periodically capturing indicator register data
In this section, the experimental setup for periodically recording indicator register data is introduced. As shown in Figure 3, a PC is used as the host to collect data from the Blackfin through the background telemetry channel (BTC) that communicates using the JTAG interface. The data logging program runs on the PC and periodically sends BTC instructions to the Blackfin processor. In response, the Blackfin processor sends a snapshot of the indicator registers to the host.

Figure 3 Experimental setup for periodically capturing indicator register data


The Blackfin processor uses a general-purpose timer to generate interrupts periodically. When the timer issues an interrupt, the contents of the index register are read out and stored in memory. When the host issues a request, the stored index register data is sent to the PC through the BTC channel. The BTC channel supports data transfer rates up to 3Mbps.


Now consider an example program where multiple buffers are mapped in the DDR memory and memory DMA\'s are used to transfers data between these buffers.

Figure 4 Example of multiple data transfers in external DDR memory


In this example, MDMA0 transfers 4KB of data from srcBuffer0 to dstBuffer0 and MDMA1 transfers 4KB of data from srcBuffer01 to dstBuffer1. Initially, only MDMA0 is enabled. After the MDMA0 data transfer is completed, the MDMA1 channel is enabled and vice versa. This causes the number of memory bank accesses to vary in different time intervals. In this example, a snapshot of the indicator register shows the following (see Figure 5). From this number, it is not possible to tell which memory bank caused the page miss and which data stream channel is responsible for the page miss. Observing the indicator register multiple times periodically can help us find the cause of low bandwidth utilization.

Figure 5 A snapshot of the indicator register data in Example 2 [page]

We will use the above experimental setup to record the index register data. The index register data obtained on the PC can be used to draw a correlation graph between page misses and memory bank accesses, and the data can be analyzed using a mathematical toolbox such as MATLAB. From the graph, it can be seen that most page misses are caused by memory bank 0 accesses.

Figure 6 Correlation between Page Miss and DDR Bankx Access

Figure 7 Memory bank access and page miss

Figure 8: Unoptimized layout of Example 2

Figure 9 Cache layout optimization


Using the linker description file (ldf) or using the Blackfin processor memory window, you can determine which caches are mapped to these groups and remap them to other groups separately, thereby reducing page misses.


Bus Grant Count Registers
The Bus Grant Count Registers (EBIU_DDRGCx) help us understand the resource utilization of the individual system buses (EAB and DEBx buses). In practice, this will help determine the bus arbitration strategy and ensure efficient DMA and external memory resource sharing.


The Blackfin BF54x processor family provides programmable priority settings for external buses. In addition, the processors map several peripheral DMAs and memory DMAs to multiple DMA controllers, providing additional flexibility for efficient resource management.


Consider an example of obtaining video data from a camera. The compression algorithm runs on the Blackfin, and the compressed video data is sent from the Blackfin to the PC via the USB bus. The observations show that the USB throughput is quite low and cannot transmit the compressed video data in real time. One possible reason is that the USB bus is suspended due to other high-priority tasks in the system. For this case, we can use the grant count register to quickly verify. As above, we observe the data of the indicator register over a period of time. The data of the indicator register over several time intervals reveals that the DEB2 bus (USB bus) is competing with the EAB bus (core bus), thereby limiting the USB access to the DDR memory.


By default, the kernel has a higher priority than the USB interface for external memory access. For the current application, the real-time requirements of the USB bus have a higher priority than the kernel. Therefore, we must use one of the bus arbitration registers to increase the priority of the USB relative to the kernel to solve this problem.


The bus grant count register can also be used in conjunction with the memory bank access register to understand which bus is most active in a given time interval and find correlations between page misses and bus activity in a given time interval. Information such as the memory bank access count, the bus causing the page miss, and which resources are utilizing the bus can reveal inefficient code or data memory layouts.

Reference address:Measuring and Improving Embedded System Performance Using the System Bus

Previous article:Design of a portable automatic test system for new radar digital circuits
Next article:Design and implementation of RS485 interface error code tester based on FPGA

Latest Test Measurement Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号