How to achieve high-performance DSP processing
Application development usually starts with C prototype code written on a personal computer or workstation, and then the code is ported to an embedded processor and optimized. This series of articles extends this level of optimization to the system level to include the following three aspects of technology: memory management, DMA management, and system interrupt management. These optimization measures are as important as program code optimization.
In most systems, there is a lot of data that needs to be transferred and a high data rate is required, so you end up using a mix of all the memory in the processor, both internal and external.
Software architecture selection
Before we begin designing, we must determine what type of software "architecture" to use, which is the underlying software structure that moves program code and data in an embedded system. The architecture also affects the performance of the system because it defines how much storage and other system resources are used. The designed architecture can also reflect certain performance characteristics, ease of use, and other application requirements. Software architectures are divided into the following categories: high-speed real-time processing; ease of programming takes precedence over performance requirements; performance is the first consideration.
The first type of high-speed real-time processing architecture is ideal for safety-critical applications or systems without external memory. In this case, either the time required to buffer data is unbearable or the corresponding system resources are not available. Since there is no external memory, all work must be done on-chip. In this case, the data needs to be read and processed first, then a judgment is made, and then the data is deleted. However, it must be guaranteed that the buffered data frame in use will not be overwritten before all processing of the current frame is completed.
For example, a lane departure system is a safety-critical application. In this system, you usually cannot wait for 33 milliseconds of a full frame of data before making a decision, and it is better to process part of the frame. For example, you can start detecting lanes at the end of the frame, so you only need to read in the data at the end of the frame.
The second architecture is usually used when ease of programming is the most important consideration. This architecture is ideal for applications that require fast time to market and applications where rapid prototyping and ease of programming outweigh performance requirements, and it also reduces development effort.
When the system needs to achieve the best performance, the third type of architecture is the appropriate choice. Since the focus is on performance, careful consideration needs to be given to the selection of certain factors, such as processors, data flows, bandwidth efficiency, and optimization techniques. However, the disadvantage of this architecture is that reusability and scalability are reduced.
Planning instruction and data flow upfront is important during the development cycle, including making important decisions about whether external memory or cache is needed. This allows developers to focus on exploiting the processor's architectural features and tuning performance without having to revisit the initial design.
Cache Overview
Cache can store instructions and data in the processor's on-chip memory with very fast access time (usually a single cycle). The implementation of cache is to reduce the system's demand for the number of memory resources accessed in a single cycle. The processor structure based on cache initially places data in low-cost, low-speed external memory. When needed, the cache can automatically transfer instructions and data from it to the processor's on-chip memory.
The instruction and data caches provide the highest bandwidth transmission path for the Blackfin processor core, but the problem with the cache is that it cannot predict what data and instructions the program will need next. Therefore, the cache provides some functions that allow users to control the operation of the cache. In the Blackfin processor, some key instruction segments can be locked into the high-speed instruction cache so that they can be used directly when needed.
It is worth noting that when the cache determines which instructions to keep, it automatically keeps the most recently used instruction segments. Since DSP software spends most of its time in loops, DSP programs tend to access the same instructions repeatedly. Therefore, without any user intervention, the instruction cache can greatly improve system performance.
In addition to the functions of the high-speed instruction cache, the high-speed data cache also provides "write-through" and "write-back" modes. In the "write-through" mode, the modification of the data in the cache is transferred to the external memory. In short, it is best to start programming in the "write-back" mode, which can improve the efficiency by 10-15%, and in most algorithms, it is more efficient than the "write-through" mode. If data needs to be shared among multiple resources, it is also useful to use the "write-through" mode because the consistency of the data must be maintained. For example, in the ADSP-BF561 processor, the "write-through" mode is very useful to realize the sharing of data between two processor cores. In a single-core processor, this mode is also beneficial if the DMA controller and the cache access the same data.
[page]
Using DMA to improve performance
DMA is another effective tool for improving system performance. Because DMA access is independent of the processor core, the processor core can focus on processing data. In an ideal configuration, the processor core only needs to set up the DMA controller and respond to the interrupt when the data transfer is completed.
Typically, high-speed peripherals and most other peripherals have DMA transfer capabilities. Some DMA controllers also allow data transfers between external memory and internal memory, as well as within memory spaces. If the designer designs the system carefully, a huge performance improvement will be achieved because any data transferred by the DMA controller does not need to be "thought" by the processor core.
The Blackfin processor supports two-dimensional DMA transmission, as shown in Figure 1. The left side shows the input buffer data, with red, green, and blue primary color data placed alternately. The one-dimensional to two-dimensional DMA conversion converts the alternating data into independent red, green, and blue data. The lower left corner of Figure 1 is the pseudo program code for reading data. If there is no DMA controller, these data transfers can only be completed by the processor core. After using the DMA controller, DMA is responsible for data transmission, and after the transmission is completed, it interrupts the processor core, and the processor core can be freed to do other tasks, such as data processing.
Figure 1: Two-dimensional DMA memory access pattern.
DMA can also be used in conjunction with cache. Typically, DMA transfers first read data from a high-speed peripheral into the processor's external memory, and the data cache reads the data from the external memory into the processor. This operation usually requires the use of a "ping-pong" buffer, one buffer for data transfer and the other for data processing. Figure 2 illustrates this operation. When the DMA controller transfers data to buffer0, the processor core accesses buffer1, and vice versa.
Figure 2: Maintaining data consistency when DMA and cache are used together.
When using DMA and cache in conjunction, it is important to maintain consistency between the data read by the DMA controller and the data in the cache. Figure 2 shows how this is done. When the peripheral generates new data, the DMA controller places the data in a new buffer and generates an interrupt to notify the processor core that it can process the data. Before the processor core processes the buffer data, the cache line corresponding to the buffer is invalidated, forcing the cache to fetch the data from main memory, thus ensuring consistency. The main disadvantage of this approach is that it does not achieve the performance of a single DMA model, where the DMA controller uses a model where the buffer data is read directly into internal memory.
Instruction Division
Instruction partitioning is usually simple. If the program code can be accommodated in the internal memory, you only need to turn off the instruction cache and directly map the program code to the internal memory to obtain the maximum performance. However, most application code cannot be accommodated in the internal memory, so the high-speed instruction cache must be turned on.
The cache capacity is usually smaller than the external memory, but this is not a problem because for most embedded software, "usually 20% of the program code runs 80% of the total execution time". In most cases, the most time-consuming program code is small enough to be placed in the cache, so the cache can fully play its role.
To improve performance, you can also use the instruction line-locking mechanism to lock the most critical parts of the program code. If you need to further improve performance, you can turn off the instruction cache and use a "memory overlay" mechanism instead, which uses DMA to transfer program code to one memory block while performing operations on another memory block.
Data partitioning
Data partitioning is usually not as simple as instruction partitioning. As with program code partitioning, if the data buffer can be accommodated in internal memory, you have no extra work. If not, the first task is to distinguish between static data (such as for lookup tables) and dynamic data. Data caches are better used for static data, while DMA usually performs better for dynamic data.
Even if a data cache is used, it is usually necessary to set up a peripheral DMA transfer channel to transfer data from the peripheral to external memory. If a data cache is used, the data can be read into internal memory by invalidating the cache buffer before accessing the data. If DMA is being used, a DMA transfer can be established to read data from external memory into internal memory.
Previous article:Realization of Track Frequency Shift Signal Demodulation Based on DSP
Next article:Field Programmable Technology of DSP Devices
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- UCS clock system of MSP430F6638
- Playing with Zynq Serial 48——[ex67] Vivado FFT and IFFT IP core application examples
- How to distinguish between voltage series negative feedback circuit and current series negative feedback circuit
- Multifunctional open source custom macro keyboard
- Structure and strategy for solving complex problems with artificial intelligence (6th edition)
- The designated components for the 2019 TI Cup National Undergraduate Electronic Design Competition have been announced, and EVM board applications will be open soon
- Urgent! Installing harmony plugin after MPLAB X IDE installation fails
- DSP28335 uses FIFO serial port interrupt
- Who made this picture? Haha, what an image!
- How to use allegro package library files in DXP?