TI C6000 Data Storage Processing and Performance Optimization

fish001

TI C6000 Data Storage Processing and Performance Optimization [Copy link]

Memory is to CPU like warehouse is to workshop. Raw materials, semi-finished products, finished products, etc. in the workshop processing process need to be in and out of the warehouse. No matter how efficient the production is, if the warehouse turnover is not good, it will inevitably cause production blockage. Just as warehouses need reasonable planning and management, data storage also needs appropriate processing techniques to improve the computing performance of the CPU.

Based on TI C6000 series DSP, this article introduces the memory knowledge related to computing performance optimization. It gives the corresponding code optimization strategy for specific data storage problems and discusses the concepts that are easy to confuse.

Noun Description

EMIF: External Memory Interface

PMC: Program Memory Controller

DMC: Data Memory Controller

SPC: Section Program Counter

Bank Conflict Vs. Bank Alias Ambiguity[1]

1. Bank conflict

The on-chip memory structures of each DSP in the C6000 series are different. Most of them use an interleaved access memory structure, as shown in Figure 1. The numbers in the box represent byte addresses. Because each bank is a single-port memory, each bank can only be accessed once per cycle. For example, if there are two short data a and b, a is stored in address 1-2 and b is stored in address 8-9, then a and b cannot be arranged to be stored/accessed in parallel in the program, otherwise it will cause memory access delays, causing the pipeline to pause for one cycle, and the second read/write memory is performed during the pause cycle. This is the memory conflict phenomenon.

Figure 1 4-bank cross-access memory

Take the 16-bit short dot product calculation as an example:

int dotp(short a[], short b[]);

An efficient software pipelined kernel is as follows. In the last two instructions, the LDW "load word" command is used to load a[0], a[1], b[0], and b[1] simultaneously in one cycle.

In order to prevent the software pipeline from being blocked, it is necessary to ensure that the parallel loading of arrays a and b does not cause storage conflicts.

Conflict example: a's first address = 0; b's first address = 8N

Non-conflicting example: a's first address = 0; b's first address = 8N+4

However, it is not always possible to fully control where arrays and other objects start in memory, especially when a pointer is passed as an argument to a function, and the pointer argument to the function may point to a different memory location.

If we do not know the arrangement of arrays a and b in the memory bank, we can only be sure that a[0-1] and a[2-3] will not have a memory bank conflict, and the same is true for b[0-1] and b[2-3]. Therefore, we can arrange a[0-1] and a[2-3], b[0-1] and b[2-3] to be accessed simultaneously by loop unrolling, thus avoiding possible memory bank conflicts. The software pipeline kernel after loop unrolling is as follows:

In addition, in linear assembly, the ".mptr" pseudo-instruction can be used to provide the compiler with data storage-related information, allowing the compiler to automatically analyze whether storage conflicts will occur and adjust the instruction arrangement.

2. Storage alias fuzziness

When multiple different variable names refer to the same storage area, alias ambiguity occurs, that is, the instructions that operate on these variables may have storage dependencies. The dependencies between instructions restrict the arrangement of instructions, including software pipelining.

The assembly optimizer assumes that all memory references are aliased, and it hands control to the user, who provides information about whether the storage is aliased. The programmer can provide storage aliasing information through a compilation option/"restrict" keyword/two linear assembly pseudo instructions.

-mt compilation option: indicates that there is no storage aliasing in the code. Carefully judge whether to use -mt. If the code uses aliasing technology and the -mt option is set, unexpected results may occur.

restrict keyword: In C programming, array or pointer variables are declared with restrict to prompt the compiler that the storage area pointed to by the variable will not overlap with the storage areas pointed to by other variables.

.mdep directive: used to explicitly declare memory dependencies.

.no_mdep directive: tells the assembler optimizer that no memory dependencies occur in the function body.

*Don't confuse the concepts of "memory alias ambiguity (memory dependency)" and "memory bank conflict". They have different meanings and impacts. Alias ambiguity affects the correctness of the program (of course, it may also affect performance), and memory bank conflict affects the performance of the program. Memory dependency has a greater impact on instruction scheduling than memory bank conflict.

Memory Mode Vs Data Termination Method[1,2]

1. Memory mode

The C6000 compiler supports two memory modes: small memory mode and large memory mode.

Small memory mode: The .bss segment is limited to 32kB, and the CPU can access all objects in the .bss segment using direct addressing without changing the value of DP (B14).

Large memory mode: There is no limit on the size of the .bss segment, but the CPU can only access the data in .bss through register indirect addressing, that is, the object address needs to be read into the register first, which brings additional operations.

When global/static variables (stored in the .bss segment) exceed 32kB and you want to use small memory mode for fast access speed, there are two solutions:

For large array definitions, use the far keyword so that the data does not occupy the .bss segment space but is placed in the .far segment.

With the -ml/-ml0 option, the compiler automatically uses far access for collection data types (such as structures and arrays).

2. Data termination method

Refers to the storage order of high and low significant bits in multi-byte data. C6000 supports two termination modes: little endian termination mode and big endian termination mode.

Little-Endian: The high-significant byte of the data is stored in the high-order byte of the address (high-order high address).

Big-Endian: The high-significant bytes of the data are stored in the low-order bytes of the address (high-order low address).

Memory boundary alignment [1,3]

C67X DSP supports single storage/retrieval of 16-bit (half-word), 32-bit (word), and 64-bit (double-word) data, but the premise is that the data storage meets half-word alignment, word alignment, and double-word alignment respectively. Currently, C64X supports single storage/retrieval of 32-bit and 64-bit width data in non-aligned conditions.

The so-called half-word alignment means that the lowest bit of the data address is 0; word alignment means that the lowest 2 bits of the data address are 0; double-word alignment means that the lowest 3 bits of the data address are 0.

For devices that do not support non-aligned single deposit/retrieval, if the CPU uses multi-byte deposit/retrieval instructions to operate non-aligned data at one time, additional operations will be generated, and some processors may not be able to handle it and may even generate errors! The following figure shows a processing example. It can be seen from the figure that an operation that could have been completed in a single time took 5 operations due to data misalignment.

Figure 2 Multi-byte access to unaligned data

In C64X, arrays are aligned to 8 bytes (double words) by default; in C62X and C67X, arrays are aligned to 4 bytes/8 bytes.

The alignment of a structure is determined by its largest data type member. The storage space occupied by a structure is always a multiple of the size of the largest member type (note that it is not simply multiplied by the number of members). For example, the following two structures A and B occupy 8 and 12 storage spaces respectively.

Copy code
struct A
{
short x;
short y;
int z;
};

struct B
{
short x;
int y;
short z;
};
Copy code
Aligning the data set boundary does not mean that the address of each element in it is a multiple of the alignment length, but it ensures that the start address and <end address + 1> of the data set are multiples of the alignment length.

Aligned memory access

In C/C++ code, there are three pragma pre-compilation statements that can be used to instruct the compiler to align and store specific data in a specified manner.

DATA_ALIGN: Align the data to an integer power of 2

DATA_MEM_BANK: Align data to the specified bank

STRUCT_ALIGN (C specific): used to specify the alignment of structures and unions to integer powers of 2

The _nassert() inline function can be used to instruct the compiler on the memory alignment status of a data.

_nassert( ((int)sum & 0x3) == 0);

tells the compiler that sum is aligned to a word boundary. With this information, the compiler can safely schedule SIMD (single instruction, multiple data) instructions to operate on the data, but _nassert itself does not generate any operations.

You can use the _amemXX() and _amemXX_const() intrinsics to access aligned words and halfwords. Generally, this type of memory access can be used in conjunction with data unpacking and packing intrinsics such as _hi(), _lo(), and _itod().

Unaligned memory access

C64X supports unaligned word and double word accesses. The comparison between boundary-aligned and non-boundary-aligned data accesses is shown in the following table:

As can be seen from the above table, C64X can only perform one non-boundary aligned memory access per clock cycle. Therefore, boundary aligned memory access should be used whenever possible.

In C/C++ code, you can use the _memXX() and _memXX_const() inline functions to access non-aligned words and half-words. Generally, this type of memory access can be used in conjunction with data unpacking and packing inline functions such as _hi(), _lo(), and _itod(). The following is an example of use:

C6000 Cache [4,5]

Why do we need Cache?

Large-capacity memory (such as DRAM) has limited access speed, which is generally much slower than the CPU clock speed; small-capacity memory (such as SRAM) can provide fast access speed. Therefore, many high-performance processors provide a hierarchical storage access architecture.

As shown in Figure 3, the left and right sides are a flat memory architecture and a multi-layer memory architecture with 2-layer cache. In the architecture on the left, even if the CPU can run at 600MHz, since the on-chip/off-chip memory can only run at 300MHz/100MHz, the CPU needs to insert a wait cycle when accessing the memory.

Figure 3 Flat and hierarchical memory architectures

Cache part working status description

Cache hit: For programs/data that have been cached, access will cause a cache hit, and the instructions/data in the cache will be immediately sent to the CPU without waiting.

Cache miss: When a cache miss occurs, the required instructions/data are first read in through EMIF. The instructions/data are stored in the cache while being sent to the CPU. The CPU is suspended during the process of reading the program/data.

Cache flush: Clear the cached data.

Cache freeze: The cache content no longer changes. When a miss occurs, the instruction packets read from the EMIF will not be stored in the cache at the same time.

Cache bypass: The cache contents will not change, and any program/data will be accessed from the memory outside the cache.

C6000 storage architecture

The C6000 series DSP provides two layers of cache, L1 and L2, between the on-chip RAM and the CPU. Each layer of cache is divided into independent program cache and data cache. L1 is fixed, and L2 can be remapped to ordinary on-chip RAM.

When accessing a program or data, the CPU first searches the L1 Cache. If a hit occurs, it accesses the data directly. If a miss occurs, it continues to search the L2 Cache. If a hit occurs, it searches the on-chip RAM or off-chip RAM for the data.

Figure 4 Program/data access flow of C6000 CPU

The rules of access positioning

As shown in Figure 4, to ensure the CPU's storage access efficiency, it is only effective when the CPU only accesses the storage area closest to it. Fortunately, this can be guaranteed according to the law of access positioning. The law of access positioning shows that the program only needs a relatively small size of data and code in a relatively small time window. Two laws of data positioning:

Spatial association: When a piece of data is accessed, its adjacent data is likely to be accessed by subsequent storage.

Time correlation: When a storage area is accessed, it will be accessed again at the next nearby time point.

Optimizing cache performance

Based on the rules of access positioning, some basic principles for optimizing cache performance can be summarized:

Let the function process the data as fully as possible to improve data reuse.

Organize data and code to improve cache hit rates.

Reasonable space division to balance program cache and data cache.

Group functions that operate on the same data in one storage area.

Segment [1,6]

The smallest unit of an object file (.obj) is called a segment, which is a block of code or data occupying a continuous space. One of the functions of the connector is to relocate the segment to the memory map of the target system. All segments can be relocated independently, and the user can place any segment into any specified block of the target memory.

A COFF file contains three default sections: .text, .data, and .bss. Users can also create, name, and connect their own sections, and can continue to divide subsections in each section.

In C/C++ code, there are two precompiled statements that can be used to allocate specific code or data to a specific segment:

CODE_SECTION: Assign a section to the code.

DATA_SECTION: Allocate a segment for data.

Stack and Heap[1,6]

The stack (.stack) and heap (.heap) are two storage areas that provide support for the processor runtime.

The stack is a variable storage area that is allocated by the compiler when needed and automatically cleared when not needed. It is used to store temporary data such as local variables and function parameters.

The heap is used for dynamic memory allocation. The heap is located between the bss area and the stack area in memory. It is usually allocated and released by the programmer. If the programmer does not release it, it may be reclaimed by the OS when the program ends. For example, the malloc() function commonly used in C allocates an area in the heap to store data.