Multi-core Programming Framework for Embedded Multimedia Applications-EEWORLD

Collect

Embedded processors based on single-core structures are increasingly unable to meet the growing processing requirements of embedded multimedia applications. Multi-core embedded structures have become an effective way to solve this problem, but they also bring challenges to how to fully develop and utilize application software with multi-core structures. Current compilation technology and development tools need to be more sophisticated to make the application of multi-core structures successful. Most parallel software is implemented by manually converting sequential programs into parallel programs. The lack of development tools with multi-core awareness makes it difficult to evaluate the performance of the software. Therefore, without effective and reliable engineering planning in advance, we will have to face problems such as inefficient application software and delayed product time to market.

The software framework provides a better starting point for the development of multi-core application software and can help shorten the development time. This article will detail the design framework of embedded multimedia application software. At the same time, the data flow model in this article can also be extended to many other applications. The framework integrates the inherent data parallel structure of multimedia application software and explains how to effectively manage data flow by using the underlying architecture.

There are two major challenges in designing parallel software: developing effective parallel algorithms and effectively utilizing shared resources such as memory, DMA (direct memory access) channels, and interconnect networks, where the performance of sequentially running applications can scale with the number of available processor cores.

There are often multiple approaches to parallelizing applications. Some applications exhibit inherent parallelism, while others have extremely complex and irregular data access patterns. In general, however, scientific computing applications and multimedia applications are usually easier to parallelize because their data access patterns are more predictable than those of control applications. This article focuses on parallelization techniques for multimedia algorithms, which require high processing power and are more suitable for embedded systems applications.

There are levels of data parallelism in multimedia applications. The granularity of parallelism between a set of data frames and a macroblock in a data frame is very different. Generally speaking, the smaller the granularity, the higher the level of synchronization required between shared units such as processor cores and DMA channels. The smaller the granularity, the higher the degree of parallelism and the less network traffic; the larger the granularity, the lower the synchronization requirements, but the network traffic will increase. Therefore, based on the different types of applications and system requirements, the software framework also defines different levels of parallelism.

It is important to note that the development of scalable parallel software also relies on the effective use of the interconnect network, memory hierarchy, and peripheral/DMA resources. The stringent low power and low cost requirements of the system will impose constraints on all of these elements. Effective use of these resources requires innovation when programming in a multicore environment. This article presents some ideas for effectively managing resources on the ADI Blackfin ADSP-BF561 dual-core processor.

Multimedia data stream analysis

To achieve data parallelism, you need to find a block or set of blocks in the data flow that can be processed independently and "fed" to a processing element. Independent blocks of data can reduce synchronization overhead and simplify parallel algorithms. To find such data, you must understand the data flow model of the application, or the "data access pattern."

For most multimedia applications, the data access mode can be viewed as 2D (spatial domain) and 3D (temporal domain) operation modes. In 2D mode, independent data blocks are confined to a single data frame, while in 3D mode, independent data blocks can span multiple frames. In the spatial domain, the frame can be divided into segments consisting of N consecutive lines and video frame macroblocks, while in the temporal domain, the data stream can be further subdivided into frame level or group of pictures (GOP) level.

Algorithms that use a slice or macroblock data access pattern have higher synchronization requirements but less network traffic because the hierarchical memory system only needs to store a portion of the image data. For a frame or group of pictures type data access pattern, the hierarchical memory system needs to store a large amount of data, but the synchronization requirements are much lower because the system has a larger parallel granularity. Figure 1 illustrates the levels of parallelism in multimedia applications, showing the relative synchronization requirements and network traffic for the four levels.

Figure 1 Multimedia applications exhibit different levels of data parallelism, which correspond to different synchronization requirements and network traffic.

Multi-core structural analysis

Figure 2 shows the architecture of the ADSP-BF561, which includes independent instruction and data memories, each of which is dedicated to the two processor cores, as well as shared L2 memory and external memory. The user can connect all peripherals and DMA resources to either processor core using a configurable arbitration scheme. The processor has two DMA controllers, each consisting of two sets of MDMA (memory DMA) channels. The L2 memory is connected to each processor core via an independent bus, and the external memory is connected to the two processor cores via a shared bus.

Figure 2 The architecture of the ADSP-BF561 includes independent instruction and data memories, each of which is dedicated to the two processor cores, as well as shared L2 memory and external memory.

All frameworks use DMA to stream data into the memory hierarchy. Another option is a cache, which does not manage any data. If the data access pattern of the target application is clear, the DMA engine can be used to manage the data effectively. Using a cache requires enduring uncertain access times, the cost of cache misses, and the need for high external memory bandwidth. With a DMA engine, data can be sent to L1 memory before the processor core requests the data, and the system performs the transfer operation in the background without pausing the processor core for data item requests.

Since each DMA controller has two sets of MDMA channels, the system can evenly distribute the MDMA channels on the processor cores, thereby enabling symmetrical parallel processing.

For applications with smaller data access pattern granularity, fast access to L1 and L2 memories can be easily utilized. Independent blocks of data can also be directly transferred from the peripheral interface to L1 or L2 memory without accessing the slow external memory, which saves valuable external memory bandwidth and MDMA resources and shortens data transfer time.

For applications with large granularity of data access patterns, memory may become a bottleneck because the smaller L1 and L2 memory levels are not enough to accommodate a large number of data frames. However, although there is data correlation between a large number of data frames, this correlation usually only exists on smaller data blocks across data frames. If all related data frames can be stored in a larger storage space (external memory), the independent data blocks in each frame can be sent to the idle processor core for processing one after another. If these independent data blocks are much smaller than the data frame and fit into the capacity of L1 or L2 memory, the memory access delay can be reduced and the data can be processed efficiently.

Although L2 and external memory have independent bus connections, the two processor cores still share these memory interface buses. Therefore, it is necessary to avoid the two processor cores from accessing the same level of memory at the same time to avoid stopping work due to bus conflicts. To reduce bus conflicts, the framework should consider the mapping of code and data objects so that one processor core mainly accesses the L2 memory core, while the other processor core mainly accesses external memory. In this case, although the processor core will have a larger access delay when completing most external memory accesses, the total access delay is still less than the cost of bus conflicts.

The framework assigns all input peripheral interfaces to one processor core and all output peripheral interfaces to another processor core. The framework uses video input/output interfaces, such as PPI (Parallel Peripheral Interface), to input and output video frames. The BF561 architecture has two PPI interfaces.

If the interrupt processing time is shorter than the data flow processing time, all peripheral interfaces can be assigned to one processor core for ease of programming. The shorter interrupt processing time will not affect the load balance of the two processor cores.

Proposed Model for Software Framework

Based on the granularity of the data access pattern, four software frameworks can be defined: row processing (spatial domain), macroblock processing (spatial domain), frame processing (temporal domain), and GOP processing (temporal domain). If the data access pattern of an application fits any of these four models, the corresponding framework can be used. If a data stream has two or more processing algorithms, multiple frameworks can be combined to achieve asymmetric parallel processing.

In row processing mode, the correlation exists only at the row level, that is, only between adjacent pixels. Each row of data forms a data block that can be processed independently by each processor core.

Figure 3 shows the data flow model of the row processing framework. Processor core A processes the video input and processor core B processes the video output. Data between cores A and B is managed by independent MDMA channel groups. The L1 memory uses multiple buffers to avoid conflicts between the processor cores and peripheral DMA access buses. The synchronization of each row of data between the two processor cores is achieved by counting semaphores. In this framework, it is also advantageous to use a single processor core to store data directly into the L1 memory, which can save external memory bandwidth and DMA resources. Application examples of this framework include color conversion, histogram equalization, filtering, and sampling.

Figure 3 Data flow model of the row processing framework. Processor core A processes the video input, and processor core B processes the video output. [page]

Figure 4 shows the data flow model of the macroblock data access pattern, where macroblocks can be transferred alternately between the two processor cores. The L2 memory has multiple fragment buffers, and independent MDMA channels transfer macroblocks from the L2 memory of each processor core to the L1 memory. The L1 memory also has multiple buffers to avoid conflicts between DMA and processor core access to the bus. Similar to the row processing framework, in this framework, processor core A controls the input video interface, processor core B controls the output interface, and counting semaphores are used to achieve synchronization between the two processor cores. Application examples of this framework include edge detection, JPEG/MPEG encoding/decoding algorithms, and convolutional coding.

Figure 4 In dual-core macroblock data access mode, the L2 memory has multiple fragment buffers and independent MDMA channels transfer macroblocks from the L2 memory of each processor core to the L1 memory.

In the frame-level processing mode, the external memory stores the associated frames. Depending on the granularity of the association between the data frames (macroblocks or lines), the system transfers the sub-blocks of the data frame to the L1 or L2 memory. Figure 5 shows the data flow model of the frame-level processing framework. In this case, assuming that a macroblock is associated between multiple frames, the system transfers the macroblocks of the data frame to the L1 memory. Similar to other frameworks, in this framework, processor core A controls the input video interface and processor core B controls the output interface, and the synchronization between the two processor cores is achieved through counting semaphores. Application examples of this framework include motion detection algorithms.

Figure 5: In the frame-level processing flow, external memory stores independent frames

In the GOP-level processing mode, each processor core processes multiple consecutive data frames. The difference between the frame-level processing framework and the GOP-level processing framework is that the former is spatially partitioned within the frame, while the latter achieves parallel processing through temporal partitioning (frame sequence). For the GOP data access mode, the correlation exists within a group of data frames, and there is no correlation between the data of two groups of frames. Therefore, the processor core can process each group of frames independently. Figure 6 shows the data flow of this framework. Similar to the frame-level processing framework, the system can transfer frame data blocks to the L1 memory of the processor core. In order to effectively utilize the interleaved bank structure of the external memory, the system evenly distributes the banks among the processor cores. Each external bank of the ADSP-BF561 supports up to four internal SDRAM banks. Application examples of this framework include encoding/decoding algorithms such as MPEG-2/4.

FIG6 In the GOP-level data access mode, the correlation exists within a group of data frames, and there is no correlation between the data of two groups of frames.

In actual applications, the system may use multiple algorithms to process data streams, and each algorithm may use different data access patterns. In this case, several frameworks can be combined for special applications. To utilize the multi-core structure, pipeline processing can be used to achieve parallel operation, but this parallel operation is asymmetric because different calculations may be performed on different processor cores. However, the system can assign some other tasks to the idle instructions of the processor core to achieve a balance of the workload of the processor core while maintaining flexibility. Figure 7 shows the data flow model of the framework that combines row-level processing and macroblock processing.

Figure 7 Data flow model of the framework combining row-level processing and macroblock processing

In some other applications, there is data dependency between multiple blocks of data. The data access pattern is still predictable, but it is extended beyond the granularity of a macroblock or a row. For example, a motion window search may use several adjacent macroblocks. The data access pattern is still predictable, but the system needs to access multiple blocks of data between multiple iterations of the algorithm. In this case, the software framework can be modified to achieve effective parallel operation. For example, if there is dependency between multiple rows, the row processing framework can be adjusted to transfer frame fragments of N consecutive rows to the L1 memory of each processor core. In a similar way, the macroblock processing framework can also be extended to transfer multiple macroblocks from L2 memory to internal L1 memory.

Software framework analysis

To evaluate the software framework for dual-core processing, ADI first developed a single-core application software using a data flow model and then compared it with a dual-core solution. Blackfin's unique system optimization technology can also effectively utilize the available bandwidth. To simplify the analysis, ADI only compared the processing speed of the basic framework without considering the combination of several architectures.

Cycles are the number of processor core compute cycles used to process the data stream to meet the real-time constraints of the NTSC (National Television System Committee) video input. For a processor core running at 600MHz, the total number of cycles available to process each pixel to meet the real-time constraints is 44 cycles/pixel. Any processor core access to the data stream requires only a single core cycle because all data accesses are to L1 memory. The cycle numbers shown do not include interrupt latency.

As shown in Table 1, the dual-core framework effectively doubles the processing speed of all frameworks. The table also shows the L1 memory usage of each processor core and the shared memory space required by the various frameworks. These frameworks use Analog Devices' DD/SSL (Device Driver/System Service Library) to manage peripherals and data.

Table 1: Framework description

Reference address：Multi-core Programming Framework for Embedded Multimedia Applications

Previous article：Transplantation and development of μClinux embedded system based on ADSP-BF533
Next article：CPLD Solution for DRAM Controller in Embedded System

Popular Resources
Popular amplifiers