At present, embedded multi-core processors have been widely used in the field of embedded devices, but embedded system software development technology is still stuck in the traditional single-core mode, and has not fully utilized the performance of multi-core processors. Program parallel optimization is currently used to a certain extent on PC platforms, but it is still rare on embedded platforms. In addition, embedded multi-core processors are very different from PC platform multi-core processors, so the parallel optimization method of the PC platform cannot be directly applied to the embedded platform. This paper studies parallel optimization from two aspects: task parallelism and cache optimization, and explores methods for parallel optimization of programs on embedded multi-core processors.
1 Embedded Multi-core Processor Architecture
The structures of embedded multi-core processors include symmetric and asymmetric. Symmetric means that the structures of the internal cores are the same, and this structure is currently widely used in PC multi-core processors; while asymmetric means that the structures of the internal cores are different, and this structure is often used in the embedded field, and the most common one is a general-purpose embedded processor + DSP core. The embedded multi-core processor explored in this article adopts a symmetric structure to achieve parallel execution of the same code on different processors.
Figure 1 ARM SMP processor structure
In the current embedded field, the most widely used processor is ARM processor, so the ARM dual-core processor OMAP4430 is used as the research object. The ARM symmetric multi-processing (SMP) structure is shown in Figure 1. According to the principle of program locality, each processor has a private memory (Local Memory), and the most common one is the first-level cache (L1Cache). However, multiple processors involve the problem of mutual communication, so the second-level cache (L2 Cache) is used in common ARM processors to solve this problem. Based on the symmetric multi-processor structure, all processors (usually multiples of 2) are the same in hardware structure and are equal in the use of system resources. More importantly, since all processors have the right to access the same memory space, in the shared memory area, any process or thread can run on any processor, which makes the parallelization of the program possible. 2 To perform parallel optimization on an embedded multi-core platform, the following issues need to be considered:
① The performance of a parallel program depends on the serialized part of the program, and the program performance will not continue to improve with the increase in the number of parallel threads;
② Compared with PC processors, embedded multi-core processors have slower bus speeds and smaller caches, which will cause a large amount of data to be constantly copied between memory and cache. Therefore, cache friendliness should be considered during parallel optimization.
③ The number of parallel execution threads of the program should be less than or equal to the number of physical processors. Too many threads will cause threads to compete for processor resources, resulting in a decrease in parallel performance.
2 OpenMP Parallel Optimization
2.1 Introduction to 0penMP working principle
OpenMP is a cross-platform multi-threaded parallel programming interface based on shared memory mode. The main thread generates a series of child threads and maps tasks to the child threads for execution. These child threads execute in parallel, and the runtime environment assigns the threads to different physical processors. By default, each thread executes the code of the parallel region independently. You can use work-sharingconstructs to divide the tasks so that each thread executes its assigned part of the code. In this way, OpenMP can achieve task parallelism and data parallelism.
Figure 2 Task parallel model
The task parallel mode creates a series of independent threads, each thread runs a task, and the threads are independent of each other, as shown in Figure 2. OpenMP uses the compilation primitives session directive and task directive to implement task allocation. Each thread can run different code areas independently, and supports task nesting and recursion. Once a task is created, it may be executed on an idle thread in the thread pool (whose size is equal to the number of physical threads).
Data parallelism is also called data-level parallelism, which divides the data processed in a task into blocks and executes them in parallel, as shown in Figure 3. The for loop in C language is most suitable for using data parallelism.
Figure 3 Data parallel model
2.2 Principle of Quick Sort Algorithm
The quick sort algorithm is a recursive divide-and-conquer algorithm. The key to the algorithm is to determine the pivot data. Data that is smaller than the pivot data in the data sequence will be placed on the left side of the pivot data, and data that is larger than the pivot data will be placed on the right side of the pivot data. After the data scan is completed, the left and right parts of the pivot data will be recursively called by the quick sort algorithm.
The recursive calls involved in the quick sort algorithm will generate a large number of tasks, and these tasks are independent of each other, which is very suitable for the task parallel mode of OpenMP. In addition, for a quick sort search algorithm, the sentinel element plays a decisive role in the data capacity of the left and right sub-intervals. Considering the small cache space of the embedded platform, the sentinel element screening algorithm needs to be optimized to make the divided left and right sub-intervals more balanced as much as possible to meet the load balancing requirements.
2.3 Task Parallel Optimization
Through the analysis of the quick sort algorithm, it is found that quick sort is a recursive call algorithm. During the execution of the algorithm, a large number of repeated function calls will be generated, and the execution of the functions is independent of each other. For a scan operation of quick sort, the algorithm first determines the sentinel element (pivot), adjusts the data sequence once, and then recursively calls the algorithm again for the left and right intervals of the sentinel element.
As shown below, for task parallelization optimization, the operation of each sub-interval is abstracted as a task for the left and right sub-intervals adjusted for each scan, and the parallel execution of tasks is realized through the task parallelization primitive #pragma omp task in OpenMP, thereby realizing the task parallelization optimization of quick sort.
2.4 Cache Optimization
The goal of cache friendly is to reduce the copying of data between memory and cache. For 220 integer data, the data size is 4 MB, and the L2 cache of the test platform (MAP4430) in this paper is 1 MB, so the data needs to be divided into 4 parts.
As shown below, the algorithm divides the 4 parts of data into 4 quick sorting tasks. The 4 tasks are executed in parallel. After completion, the sorting of each part of the data sequence is completed. The 4 parts of data need to be merged to form a completed data sequence. Therefore, after the parallel tasks are completed, the data needs to be merged and sorted.
3 Parallelization Performance Analysis
3.1 Introduction to the experimental environment
This paper uses the OMAP4430 embedded development platform from Texas Instruments. OMAP443O is an embedded multi-core processor with a symmetric multi-processing dual-core ARM processor (Dual-core ARM Cortex-A, 32 KB of L1 cache, 1 MB of L2 cache, and the embedded operating system uses the Ubuntul2.04 kernel, the compiler is arm-linux-gnueabihf-gcc, and GNU gprof is used to obtain the algorithm execution time.
3.2 Performance Testing
As shown in the following formula, the performance of parallel optimization is analyzed by calculating the acceleration ratio. The larger the acceleration ratio value, the higher the degree of parallelism of the algorithm, and the minimum is 1. The performance test uses four algorithm versions, including serial version, parallel 2-thread, parallel 4-thread and cache optimized version, to analyze performance from different angles.
As shown in Figure 4, it can be seen from the line graph that the parallel performance of the three parallel optimization algorithms is greatly improved compared to the serial version. As listed in Table 1, their parallel speedup ratios are 1.30, 1.29 and 1.21 respectively. For the task parallel optimization scheme, the 2-thread and 4-thread versions are tested respectively. From the analysis results of the speedup ratio, it can be seen that the 2-thread version is slightly better than the 4-thread version. In theory, the more parallel threads, the better the performance. However, this paper uses OMAP443O with only two symmetric multi-processing cores. Even if the algorithm has 4 parallel threads, there are only 2 threads actually executed. At the same time, the 4 threads have a competitive relationship when obtaining 2 physical processors, which causes the performance to be lower than that of the 2-thread version.
Figure 4 Algorithm execution time
To evaluate the pros and cons of parallel algorithms, we also need to consider the load balancing of the algorithms. As shown in Tables 1 and 2, the standard deviation of the cache optimization scheme is much smaller than that of the task parallelization scheme. The reason is that for the task parallelization scheme, different test data and partitioning algorithms have an important impact on the division of intervals, resulting in a large range of task execution time variations; for the cache optimization scheme, its essence is data parallelism, and each task is divided according to the cache size. Therefore, the data scale processed by each task is basically the same, and the execution time of each task is more certain. However, after the parallel tasks are executed, the data needs to be merged, resulting in a certain performance degradation.
Conclusion
This paper analyzes the hardware structure of embedded multi-core processors and optimizes the serial quick sort algorithm in parallel from the perspective of symmetric multiprocessing, achieving good results.
Using the ARM dual-core processor (OMAP4430) as the test platform, parallel optimization is achieved from task parallelism and cache optimization. From the results of the performance test, task parallelism has a good speedup ratio, but poor load balancing. The number of parallel threads should not exceed the number of physical processor cores. Too many parallel threads compete for processor resources, resulting in performance degradation. Cache optimization has good load balancing, but requires subsequent merging operations, resulting in performance degradation.
In short, when performing parallel optimization on embedded multi-core processors, on the one hand, we must fully explore the parallel performance of embedded multi-core processors and improve the parallelism of programs; on the other hand, we must also consider the load balancing of program algorithms to ensure consistent program performance in different application environments.
Previous article:Shortwave Communication Receiver Based on FPGA
Next article:A new intelligent air conditioner remote controller based on Atmega16 microcontroller
- Popular Resources
- Popular amplifiers
- MathWorks and NXP Collaborate to Launch Model-Based Design Toolbox for Battery Management Systems
- STMicroelectronics' advanced galvanically isolated gate driver STGAP3S provides flexible protection for IGBTs and SiC MOSFETs
- New diaphragm-free solid-state lithium battery technology is launched: the distance between the positive and negative electrodes is less than 0.000001 meters
- [“Source” Observe the Autumn Series] Application and testing of the next generation of semiconductor gallium oxide device photodetectors
- 采用自主设计封装,绝缘电阻显著提高!ROHM开发出更高电压xEV系统的SiC肖特基势垒二极管
- Will GaN replace SiC? PI's disruptive 1700V InnoMux2 is here to demonstrate
- From Isolation to the Third and a Half Generation: Understanding Naxinwei's Gate Driver IC in One Article
- The appeal of 48 V technology: importance, benefits and key factors in system-level applications
- Important breakthrough in recycling of used lithium-ion batteries
- LED chemical incompatibility test to see which chemicals LEDs can be used with
- Application of ARM9 hardware coprocessor on WinCE embedded motherboard
- What are the key points for selecting rotor flowmeter?
- LM317 high power charger circuit
- A brief analysis of Embest's application and development of embedded medical devices
- Single-phase RC protection circuit
- stm32 PVD programmable voltage monitor
- Introduction and measurement of edge trigger and level trigger of 51 single chip microcomputer
- Improved design of Linux system software shell protection technology
- What to do if the ABB robot protection device stops
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Download from the Internet--ARM Getting Started Notes
- Learn ARM development(22)
- Learn ARM development(21)
- Learn ARM development(20)
- Learn ARM development(19)
- Learn ARM development(14)
- Learn ARM development(15)
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- CC2640R2: How to use the chip's internal Bootloader to burn the program
- Proteus emulation of micropython method
- What is the principle of the reading pen? How does it improve the accuracy of reading?
- Where is the lucky star? ——"Keysight Technology Double 11 Instrument Flash Sale"
- My design is equivalent to changing the original manual control to automatic control. The overall design idea is as follows, but at this stage it is still...
- Play with the Internet of Things and make a smart greenhouse model!
- I used the AD18STM32 library to import the chip, but the chip pins are incomplete
- MSP430 interrupt nesting mechanism
- MSP430 MCU Development Record (13)
- Current sensing crosstalk issue?