Parallelization Method for Embedded ARM Multi-core Processors-EEWORLD

Collect

　　At present, embedded multi-core processors have been widely used in the field of embedded devices, but embedded system software development technology is still stuck in the traditional single-core mode, and has not fully utilized the performance of multi-core processors. Program parallel optimization is currently used to a certain extent on PC platforms, but it is still rare on embedded platforms. In addition, embedded multi-core processors are very different from PC platform multi-core processors, so the parallel optimization method of the PC platform cannot be directly applied to the embedded platform. This paper studies parallel optimization from two aspects: task parallelism and cache optimization, and explores methods for parallel optimization of programs on embedded multi-core processors.

　　1 Embedded Multi-core Processor Architecture

　　The structures of embedded multi-core processors include symmetric and asymmetric. Symmetric means that the structures of the internal cores are the same, and this structure is currently widely used in PC multi-core processors; while asymmetric means that the structures of the internal cores are different, and this structure is often used in the embedded field, and the most common one is a general-purpose embedded processor + DSP core. The embedded multi-core processor explored in this article adopts a symmetric structure to achieve parallel execution of the same code on different processors.

　　Figure 1 ARM SMP processor structure

　　In the current embedded field, the most widely used processor is ARM processor, so the ARM dual-core processor OMAP4430 is used as the research object. The ARM symmetric multi-processing (SMP) structure is shown in Figure 1. According to the principle of program locality, each processor has a private memory (Local Memory), and the most common one is the first-level cache (L1Cache). However, multiple processors involve the problem of mutual communication, so the second-level cache (L2 Cache) is used in common ARM processors to solve this problem. Based on the symmetric multi-processor structure, all processors (usually multiples of 2) are the same in hardware structure and are equal in the use of system resources. More importantly, since all processors have the right to access the same memory space, in the shared memory area, any process or thread can run on any processor, which makes the parallelization of the program possible. 2 To perform parallel optimization on an embedded multi-core platform, the following issues need to be considered:

　　① The performance of a parallel program depends on the serialized part of the program, and the program performance will not continue to improve with the increase in the number of parallel threads;

　　② Compared with PC processors, embedded multi-core processors have slower bus speeds and smaller caches, which will cause a large amount of data to be constantly copied between memory and cache. Therefore, cache friendliness should be considered during parallel optimization.

　　③ The number of parallel execution threads of the program should be less than or equal to the number of physical processors. Too many threads will cause threads to compete for processor resources, resulting in a decrease in parallel performance.

　　2 OpenMP Parallel Optimization

　　2.1 Introduction to 0penMP working principle

　　OpenMP is a cross-platform multi-threaded parallel programming interface based on shared memory mode. The main thread generates a series of child threads and maps tasks to the child threads for execution. These child threads execute in parallel, and the runtime environment assigns the threads to different physical processors. By default, each thread executes the code of the parallel region independently. You can use work-sharingconstructs to divide the tasks so that each thread executes its assigned part of the code. In this way, OpenMP can achieve task parallelism and data parallelism.

　　Figure 2 Task parallel model

　　The task parallel mode creates a series of independent threads, each thread runs a task, and the threads are independent of each other, as shown in Figure 2. OpenMP uses the compilation primitives session directive and task directive to implement task allocation. Each thread can run different code areas independently, and supports task nesting and recursion. Once a task is created, it may be executed on an idle thread in the thread pool (whose size is equal to the number of physical threads).

　　Data parallelism is also called data-level parallelism, which divides the data processed in a task into blocks and executes them in parallel, as shown in Figure 3. The for loop in C language is most suitable for using data parallelism.

　　Figure 3 Data parallel model

　　2.2 Principle of Quick Sort Algorithm

　　The quick sort algorithm is a recursive divide-and-conquer algorithm. The key to the algorithm is to determine the pivot data. Data that is smaller than the pivot data in the data sequence will be placed on the left side of the pivot data, and data that is larger than the pivot data will be placed on the right side of the pivot data. After the data scan is completed, the left and right parts of the pivot data will be recursively called by the quick sort algorithm.

　　The recursive calls involved in the quick sort algorithm will generate a large number of tasks, and these tasks are independent of each other, which is very suitable for the task parallel mode of OpenMP. In addition, for a quick sort search algorithm, the sentinel element plays a decisive role in the data capacity of the left and right sub-intervals. Considering the small cache space of the embedded platform, the sentinel element screening algorithm needs to be optimized to make the divided left and right sub-intervals more balanced as much as possible to meet the load balancing requirements.

　　2.3 Task Parallel Optimization

　　Through the analysis of the quick sort algorithm, it is found that quick sort is a recursive call algorithm. During the execution of the algorithm, a large number of repeated function calls will be generated, and the execution of the functions is independent of each other. For a scan operation of quick sort, the algorithm first determines the sentinel element (pivot), adjusts the data sequence once, and then recursively calls the algorithm again for the left and right intervals of the sentinel element.

　　As shown below, for task parallelization optimization, the operation of each sub-interval is abstracted as a task for the left and right sub-intervals adjusted for each scan, and the parallel execution of tasks is realized through the task parallelization primitive #pragma omp task in OpenMP, thereby realizing the task parallelization optimization of quick sort.

　　The data size in the task space depends on the sentinel element. Therefore, the partition algorithm selected by the algorithm should try to balance the partition of the data sequence. This paper uses a simple partition algorithm and a median-of-three method for testing.

　　2.4 Cache Optimization

　　The goal of cache friendly is to reduce the copying of data between memory and cache. For 220 integer data, the data size is 4 MB, and the L2 cache of the test platform (MAP4430) in this paper is 1 MB, so the data needs to be divided into 4 parts.

　　As shown below, the algorithm divides the 4 parts of data into 4 quick sorting tasks. The 4 tasks are executed in parallel. After completion, the sorting of each part of the data sequence is completed. The 4 parts of data need to be merged to form a completed data sequence. Therefore, after the parallel tasks are completed, the data needs to be merged and sorted.

　　3 Parallelization Performance Analysis

　　3.1 Introduction to the experimental environment

　　This paper uses the OMAP4430 embedded development platform from Texas Instruments. OMAP443O is an embedded multi-core processor with a symmetric multi-processing dual-core ARM processor (Dual-core ARM Cortex-A, 32 KB of L1 cache, 1 MB of L2 cache, and the embedded operating system uses the Ubuntul2.04 kernel, the compiler is arm-linux-gnueabihf-gcc, and GNU gprof is used to obtain the algorithm execution time.

　　3.2 Performance Testing

　　As shown in the following formula, the performance of parallel optimization is analyzed by calculating the acceleration ratio. The larger the acceleration ratio value, the higher the degree of parallelism of the algorithm, and the minimum is 1. The performance test uses four algorithm versions, including serial version, parallel 2-thread, parallel 4-thread and cache optimized version, to analyze performance from different angles.

　　As shown in Figure 4, it can be seen from the line graph that the parallel performance of the three parallel optimization algorithms is greatly improved compared to the serial version. As listed in Table 1, their parallel speedup ratios are 1.30, 1.29 and 1.21 respectively. For the task parallel optimization scheme, the 2-thread and 4-thread versions are tested respectively. From the analysis results of the speedup ratio, it can be seen that the 2-thread version is slightly better than the 4-thread version. In theory, the more parallel threads, the better the performance. However, this paper uses OMAP443O with only two symmetric multi-processing cores. Even if the algorithm has 4 parallel threads, there are only 2 threads actually executed. At the same time, the 4 threads have a competitive relationship when obtaining 2 physical processors, which causes the performance to be lower than that of the 2-thread version.

　　Figure 4 Algorithm execution time

　　To evaluate the pros and cons of parallel algorithms, we also need to consider the load balancing of the algorithms. As shown in Tables 1 and 2, the standard deviation of the cache optimization scheme is much smaller than that of the task parallelization scheme. The reason is that for the task parallelization scheme, different test data and partitioning algorithms have an important impact on the division of intervals, resulting in a large range of task execution time variations; for the cache optimization scheme, its essence is data parallelism, and each task is divided according to the cache size. Therefore, the data scale processed by each task is basically the same, and the execution time of each task is more certain. However, after the parallel tasks are executed, the data needs to be merged, resulting in a certain performance degradation.

　　Conclusion

　　This paper analyzes the hardware structure of embedded multi-core processors and optimizes the serial quick sort algorithm in parallel from the perspective of symmetric multiprocessing, achieving good results.

　　Using the ARM dual-core processor (OMAP4430) as the test platform, parallel optimization is achieved from task parallelism and cache optimization. From the results of the performance test, task parallelism has a good speedup ratio, but poor load balancing. The number of parallel threads should not exceed the number of physical processor cores. Too many parallel threads compete for processor resources, resulting in performance degradation. Cache optimization has good load balancing, but requires subsequent merging operations, resulting in performance degradation.

　　In short, when performing parallel optimization on embedded multi-core processors, on the one hand, we must fully explore the parallel performance of embedded multi-core processors and improve the parallelism of programs; on the other hand, we must also consider the load balancing of program algorithms to ensure consistent program performance in different application environments.

Keywords：Embedded Reference address：Parallelization Method for Embedded ARM Multi-core Processors

Previous article：Shortwave Communication Receiver Based on FPGA
Next article：A new intelligent air conditioner remote controller based on Atmega16 microcontroller

Popular Resources
Popular amplifiers