Embedded media processor that handles tasks of both MCU and DSP
[Copy link]
An embedded media processor is now capable of handling both MCU and DSP tasks, bringing C programmers familiar with MCU-based application development into a new realm where intelligent management of code and data flows can significantly improve system performance. This is attractive to programmers who use "already-learned" programming methods and simply use instruction caches (cache, cache for short) and data caches to manage these data flows. However, the high-performance direct memory access (DMA) capabilities of media processors should be carefully considered. Because understanding the trade-offs between using cache and DMA in these applications will help better understand how to program to optimize the system.
Memory structure - requirements for memory management
Today's media processors have a hierarchical memory structure that balances several memories of different sizes and performance levels. Typically, the memory closest to the core processor (called "level 1" or "L1" memory) operates at full clock rate and usually supports the execution of single-clock cycle instructions. To efficiently utilize the bandwidth of the memory bus, L1 memory is generally divided into instruction and data segments. Typically, this memory is configured as SRAM or cache memory. For those applications with the highest permissions, on-chip SRAM can be accessed in a single clock cycle. For systems requiring longer code, additional on-chip and off-chip memory can be provided-with increased latency.
This hierarchy by itself is limited in its use to most applications that only work with slow external memory, but today's high-speed processors are degraded at very low speeds to improve performance. Programmers can manually choose to move key code into or out of internal SRAM. In addition, adding data cache and instruction cache to this structure allows programmers to more easily manually manage external memory. The cache reduces the manual management of sending instruction and data flows into the processor core. In this way, programmers do not need to consider how to send data and instruction streams into the processor core, which greatly simplifies the programming model.
Instruction Memory Management - Cache or DMA?
A quick survey of the embedded media processor market shows core processor speeds greater than or equal to 600 MHz. While this performance opens up many new applications, this top speed is only achieved when code is being read from internal L1 memory. Of course, an ideal embedded processor would have an unlimited amount of L1 memory, but this is not practical. Therefore, programmers must consider several options to fully utilize the L1 memory built into the processor when optimizing memory and data flows for their actual systems. Let's consider some of these options.
The first option is also the most direct option, where the target application code can be completely placed in the L1 instruction memory. In this case, the programmer only needs to map the application code directly to this memory space without any special operations. This is why media processors that include both MCU and DSP functions must have a unique advantage in the code density supported by this architecture.
The second approach, using a cache mechanism, allows programmers to access larger, lower-cost external memory. This cache can be used as a way to automatically move code into L1 instruction memory as needed. The main advantage of this approach is that the programmer does not have to manage moving code in and out of the cache. This approach works best when executing linear code. When executing nonlinear code, cache lines may be replaced too frequently to improve real-time performance.
The instruction cache actually has two functions. First, it pre-fetches instructions from external memory in a more efficient way. Second, because the cache generally uses some kind of "most recently used instruction" algorithm, the most frequently used instructions tend to be kept in the cache. This is very beneficial because instructions stored in the L1 cache can be completed in a single clock cycle, just like instructions in the L1 SRAM. In other words, if the code is fetched and has not been overwritten, it is ready to be executed in the next instruction cycle.
Most serious real-time programmers do not believe that this type of cache can provide the best system performance. Their reasoning is that if a series of instructions are not in the cache when they are needed, performance will be affected. The use of a cache lock mechanism can remedy this problem. Once the critical instructions are loaded into the cache, the cache line is locked so that the instructions cannot be overwritten. In this way, programmers can keep the instructions they need in the cache and let the cache mechanism manage less important instructions.
The last approach is to move code to and from L1 memory using a DMA channel independent of the processor core. While the processor core is operating in one block of memory, the DMA moves the code to the next block for execution. This approach is often referred to as an overlay technique.
Although DMAing overlay code into L1 instruction memory can provide more critical instructions than the cache approach, it comes at the expense of increased programmer effort. In other words, the programmer needs to plan ahead for a way to overlay code and configure the DMA channels appropriately. For a detailed plan that can make valuable additional management routines, this performance improvement will still be achieved.
Data Memory Management
The data memory structure of an embedded media processor is as important to the overall system performance as the instruction clock speed. Because there are often multiple data transfer tasks in progress at any one time in multimedia applications, its bus structure must support kernel and DMA access to all external and internal memory blocks. Automatic handling of conflicts between the DMA controller and the kernel is critical, otherwise performance will be greatly reduced. First, the DMA controller must be established between the kernel and the DMA, and then the interrupt should be responded to when the data to be processed is ready.
The processor typically performs data read operations as one of its basic functions. Although this is usually the least efficient mechanism for transferring data, it is the simplest way to program. Small, fast, convenient memories can sometimes be used as part of the L1 data memory, but for larger, off-chip buffers, the access time is unbearably long if the core has to read all the data from external memory. Not only does the core spend multiple clock cycles fetching data, but it also spends a lot of time busy reading data. In multimedia and other data-intensive applications, the core's data read operations are not able to cope with the constant moving of large amounts of data storage into and out of SDRAM. Although the core's data read operations are always needed, to maintain performance, it is important to use DMA or cache memory to transfer large amounts of data.
Using DMA to manage data
In order to effectively utilize DMA in multimedia systems, there should be enough DMA channels to fully support the processor's peripherals, that is, to be able to transfer data between more than one pair of memories and DMA at the same time. This is important because consider that while transferring blocks of data for core processing between external memory and L1 memory, there must be a raw multimedia data going to external memory (through high-speed peripherals). Furthermore, the DMA engine allows direct data transfer between peripherals and external memory without a "stopover" in L1 memory, thus saving external data transfers in data-intensive algorithms.
As data transfer rates and performance requirements increase, it becomes critical that the designer has "system performance tuning" controls at their disposal. For example, optimizing a DMA controller to transfer one word of data per clock cycle. When multiple data flows are being transferred simultaneously in the same direction (for example, transferring all data from internal memory to external memory), this is often the most efficient way to operate the controller because it prevents the DMA bus from having idle time.
However, when transferring multiple bidirectional video and audio data streams, "traffic control" must be used to prevent one data stream from monopolizing the bus. For example, if a DMA controller always connects the DMA bus to any peripheral circuit that is prepared to transfer one data word at a time (for example, to an SDRAM), then the overall data throughput will be reduced. In this case, the data transfer changes direction almost every clock cycle, so the latency associated with the round-trip switching time on the SDRAM bus will significantly reduce the throughput. Thus, a DMA controller with a programmable transfer data word length for each channel has a significant advantage over a controller with a fixed transfer length. Because each DMA channel can connect a peripheral device to external or internal memory, it also has an important advantage in that it can automatically connect to a peripheral device that can issue an urgent bus request.
Another feature, 2D DMA capability, provides several system-level advantages. First, it allows data to be placed into memory in a more intuitive processing order. For example, luminance and chrominance data can be automatically placed into separate memory buffer blocks as they are received from an image sensor. The interleaving and deinterleaving capabilities of 2D DMA save redundant memory bus transfers before processing video and image data. In addition, 2D DMA can minimize the data bandwidth of the system by selectively transferring only the required blocks of input image data instead of the entire image data.
Other important DMA features include the ability to prioritize DMA channels to satisfy current peripheral task requests, and to set up appropriate DMA interrupts to match these priorities. These features help ensure that data buffers do not overflow because the DMA is busy with other peripherals, and they also give the programmer greater freedom to optimize overall system performance on a per-DMA channel data flow basis.
Because internal memory is usually divided into several sub-blocks, the DMA controller and the core can simultaneously store data in different sub-blocks within a single clock cycle. For example, while the core is operating on the data in one sub-block, the DMA can store new data in another sub-block. Under certain conditions, even the same sub-block can be accessed simultaneously. When accessing external memory, there is usually only one physical bus available, which is often multiplexed by synchronous and asynchronous memory.
About Data Cache
The flexibility of current DMA controllers is a double-edged sword. When fetching a large C/C++ application between two processors, programmers are sometimes reluctant to integrate DMA functionality into the existing working code. This is where a data cache comes in handy. Normally, data is sent through the cache to L1 memory for fastest processing. This data cache is very attractive because it acts like a small DMA but involves minimal work on the part of the programmer.
Because of the typical cache line filling characteristics, data caches are very useful when the processor operates on contiguous blocks of data stored in external memory. This is because the cache not only stores the immediate data currently being processed, but also pre-fetches data from adjacent blocks of data to the current one. In other words, the cache mechanism assumes that there is a high probability that the data word currently being processed is part of a neighboring block of data that is about to be processed. This is a reasonable assumption for multimedia image, audio, and video streams.
Since data buffers are usually from peripheral circuits, access to data caches is not always as easy as access to instruction caches. This is because coherency must be manually controlled in non-* caches. The data buffers must be invalidated before any reads of new data are prepared using these caches.
In short, there is no simple answer to whether the best mechanism for command and data transfer in a specific multimedia system is cache memory or DMA. However, once the development engineer understands the trade-offs, he can enter a "neutral" state, thereby optimizing system performance.
|