How to achieve high-performance DSP processing-EEWORLD

Collect

How to achieve high-performance DSP processing

Application development usually starts with C prototype code written on a personal computer or workstation, and then the code is ported to an embedded processor and optimized. This series of articles extends this level of optimization to the system level to include the following three aspects of technology: memory management, DMA management, and system interrupt management. These optimization measures are as important as program code optimization.

In most systems, there is a lot of data that needs to be transferred and a high data rate is required, so you end up using a mix of all the memory in the processor, both internal and external.

Software architecture selection

Before we begin designing, we must determine what type of software "architecture" to use, which is the underlying software structure that moves program code and data in an embedded system. The architecture also affects the performance of the system because it defines how much storage and other system resources are used. The designed architecture can also reflect certain performance characteristics, ease of use, and other application requirements. Software architectures are divided into the following categories: high-speed real-time processing; ease of programming takes precedence over performance requirements; performance is the first consideration.

The first type of high-speed real-time processing architecture is ideal for safety-critical applications or systems without external memory. In this case, either the time required to buffer data is unbearable or the corresponding system resources are not available. Since there is no external memory, all work must be done on-chip. In this case, the data needs to be read and processed first, then a judgment is made, and then the data is deleted. However, it must be guaranteed that the buffered data frame in use will not be overwritten before all processing of the current frame is completed.

For example, a lane departure system is a safety-critical application. In this system, you usually cannot wait for 33 milliseconds of a full frame of data before making a decision, and it is better to process part of the frame. For example, you can start detecting lanes at the end of the frame, so you only need to read in the data at the end of the frame.

The second architecture is usually used when ease of programming is the most important consideration. This architecture is ideal for applications that require fast time to market and applications where rapid prototyping and ease of programming outweigh performance requirements, and it also reduces development effort.

When the system needs to achieve the best performance, the third type of architecture is the appropriate choice. Since the focus is on performance, careful consideration needs to be given to the selection of certain factors, such as processors, data flows, bandwidth efficiency, and optimization techniques. However, the disadvantage of this architecture is that reusability and scalability are reduced.

Planning instruction and data flow upfront is important during the development cycle, including making important decisions about whether external memory or cache is needed. This allows developers to focus on exploiting the processor's architectural features and tuning performance without having to revisit the initial design.

Cache Overview

Cache can store instructions and data in the processor's on-chip memory with very fast access time (usually a single cycle). The implementation of cache is to reduce the system's demand for the number of memory resources accessed in a single cycle. The processor structure based on cache initially places data in low-cost, low-speed external memory. When needed, the cache can automatically transfer instructions and data from it to the processor's on-chip memory.

The instruction and data caches provide the highest bandwidth transmission path for the Blackfin processor core, but the problem with the cache is that it cannot predict what data and instructions the program will need next. Therefore, the cache provides some functions that allow users to control the operation of the cache. In the Blackfin processor, some key instruction segments can be locked into the high-speed instruction cache so that they can be used directly when needed.

It is worth noting that when the cache determines which instructions to keep, it automatically keeps the most recently used instruction segments. Since DSP software spends most of its time in loops, DSP programs tend to access the same instructions repeatedly. Therefore, without any user intervention, the instruction cache can greatly improve system performance.

In addition to the functions of the high-speed instruction cache, the high-speed data cache also provides "write-through" and "write-back" modes. In the "write-through" mode, the modification of the data in the cache is transferred to the external memory. In short, it is best to start programming in the "write-back" mode, which can improve the efficiency by 10-15%, and in most algorithms, it is more efficient than the "write-through" mode. If data needs to be shared among multiple resources, it is also useful to use the "write-through" mode because the consistency of the data must be maintained. For example, in the ADSP-BF561 processor, the "write-through" mode is very useful to realize the sharing of data between two processor cores. In a single-core processor, this mode is also beneficial if the DMA controller and the cache access the same data.

[page]

Using DMA to improve performance

DMA is another effective tool for improving system performance. Because DMA access is independent of the processor core, the processor core can focus on processing data. In an ideal configuration, the processor core only needs to set up the DMA controller and respond to the interrupt when the data transfer is completed.

Typically, high-speed peripherals and most other peripherals have DMA transfer capabilities. Some DMA controllers also allow data transfers between external memory and internal memory, as well as within memory spaces. If the designer designs the system carefully, a huge performance improvement will be achieved because any data transferred by the DMA controller does not need to be "thought" by the processor core.

The Blackfin processor supports two-dimensional DMA transmission, as shown in Figure 1. The left side shows the input buffer data, with red, green, and blue primary color data placed alternately. The one-dimensional to two-dimensional DMA conversion converts the alternating data into independent red, green, and blue data. The lower left corner of Figure 1 is the pseudo program code for reading data. If there is no DMA controller, these data transfers can only be completed by the processor core. After using the DMA controller, DMA is responsible for data transmission, and after the transmission is completed, it interrupts the processor core, and the processor core can be freed to do other tasks, such as data processing.

Figure 1: Two-dimensional DMA memory access pattern.

DMA can also be used in conjunction with cache. Typically, DMA transfers first read data from a high-speed peripheral into the processor's external memory, and the data cache reads the data from the external memory into the processor. This operation usually requires the use of a "ping-pong" buffer, one buffer for data transfer and the other for data processing. Figure 2 illustrates this operation. When the DMA controller transfers data to buffer0, the processor core accesses buffer1, and vice versa.

Figure 2: Maintaining data consistency when DMA and cache are used together.

When using DMA and cache in conjunction, it is important to maintain consistency between the data read by the DMA controller and the data in the cache. Figure 2 shows how this is done. When the peripheral generates new data, the DMA controller places the data in a new buffer and generates an interrupt to notify the processor core that it can process the data. Before the processor core processes the buffer data, the cache line corresponding to the buffer is invalidated, forcing the cache to fetch the data from main memory, thus ensuring consistency. The main disadvantage of this approach is that it does not achieve the performance of a single DMA model, where the DMA controller uses a model where the buffer data is read directly into internal memory.

Instruction Division

Instruction partitioning is usually simple. If the program code can be accommodated in the internal memory, you only need to turn off the instruction cache and directly map the program code to the internal memory to obtain the maximum performance. However, most application code cannot be accommodated in the internal memory, so the high-speed instruction cache must be turned on.

The cache capacity is usually smaller than the external memory, but this is not a problem because for most embedded software, "usually 20% of the program code runs 80% of the total execution time". In most cases, the most time-consuming program code is small enough to be placed in the cache, so the cache can fully play its role.

To improve performance, you can also use the instruction line-locking mechanism to lock the most critical parts of the program code. If you need to further improve performance, you can turn off the instruction cache and use a "memory overlay" mechanism instead, which uses DMA to transfer program code to one memory block while performing operations on another memory block.

Data partitioning

Data partitioning is usually not as simple as instruction partitioning. As with program code partitioning, if the data buffer can be accommodated in internal memory, you have no extra work. If not, the first task is to distinguish between static data (such as for lookup tables) and dynamic data. Data caches are better used for static data, while DMA usually performs better for dynamic data.

Even if a data cache is used, it is usually necessary to set up a peripheral DMA transfer channel to transfer data from the peripheral to external memory. If a data cache is used, the data can be read into internal memory by invalidating the cache buffer before accessing the data. If DMA is being used, a DMA transfer can be established to read data from external memory into internal memory.

Reference address：How to achieve high-performance DSP processing

Previous article：Realization of Track Frequency Shift Signal Demodulation Based on DSP
Next article：Field Programmable Technology of DSP Devices