TI multi-core DSP C66XX development experience

Aguilera

TI multi-core DSP C66XX development experience [Copy link]

1. The Prolife tool under ccs5.1---Tools provides analysis of L2 and L1D. 2. For L1 P, spru187t under \ccsv5\tools\compiler\c6000\doc contains an introduction to cache layout tools, which can optimize L1P cache. You can also choose cycle approximate simulator, profile tools, which have L1P analysis. MSMC is configured as L2 by default, and can be configured as L3 according to user needs. Since the configuration of L3 only does address mapping, the physical access time should still be an order of magnitude, with little difference. The difference between L2 and L3 here should mean that L2 can only be cached by L1D and L1P, and L3 can be cached by L2, L1D, and L1P. Generally speaking, the default configuration of L2 is used. Users decide whether to configure it as L3 based on their own applications. The most common scenario for setting MSMC to L3 is: MSMC memory needs to be non-cacheable, and MSMC needs to be set to L3 RAM. The MSMC of C6678 is responsible for processing the access requests of all masters in the system (including 8 cores, SMS and SES interfaces) to MSMC SRAM and DDR3. The 4M-byte MSMC SRAM has 4 banks, which are independent slaves. That is, if two masters access two different banks at the same clock, the two accesses can be completed at the same time. If multiple masters access the same bank at the same clock, the arbitration logic in the MSMC will handle it according to priority. DDR3 has only one slave port. If multiple masters access it at the same clock, the arbitration logic in the MSMC will also handle it according to priority. The MSMC is configured as L2 by default, and can be configured as L3 according to user needs. Since the configuration of L3 only does address mapping, the physical access time should still be an order of magnitude, with little difference. The difference between L2 and L3 here should be that L2 can only be cached by L1D and L1P, and L3 can be cached by L2, L1D and L1P. Generally speaking, the L2 configuration is used by default. Users decide whether to configure it as L3 based on their own applications. The most common scenario that requires setting MSMC to L3 is: MSMC memory needs to be non-cacheable, and MSMC needs to be set to L3 RAM. In DSP development, measuring the cycles consumed by a function or a section of code is a common thing to do. Commonly used profiling and clock() are generally used in simulation. When it comes to emulation on the board, because the storage location and reading time of the data and the code under test on the board must be considered, the measurement results using this method are not so reliable. In fact, there are two counting registers TSCL/TSCH on the c64x+ core. They are at the same frequency as the CPU and together represent a 64-bit number. When the CPU runs a cycle, the register is increased by 1, so they can be used to accurately measure the cycles consumed by the CPU in a certain execution segment. Generally, we only use the TSCL register. At 594MHz, 32-bit can be tested to 7s, and TSCH is the high 32 bits. Unless the entire project is tested, it is generally not used. When using it specifically: First, assign the function under test to L1P through Link, and assign the used data to L1D. The purpose of this is to eliminate the data and instruction transfer time when the code is executed (otherwise the measured time includes the transfer time of data and instructions from outside the chip to inside the chip). Then, write to TSCL before the function or code under test, write register A0 to TSCL, and initialize it, that is, start counting; Finally, read the value of the TSCL register at the end of the function or after the code segment under test. The read value is the CPU cycles consumed by the function or code segment. Remember that the CPU must be restarted before each test, because the counter will only stop counting under two conditions, and it cannot be stopped by programming: a. Exit the reset state, that is, after restarting b. The CPU is completely powered down. In general, because these two registers are registers inside the core and have the same frequency as the CPU, using them to measure time is very accurate, and even the consumption of the compressed instruction package fpread statement (1 cycle) has been taken into account. It is especially effective when testing handwritten assembly, and it can even clearly see how many cycles an instruction is delayed. Usage: long time and wide range clock measurement unsigned long long t1, t2; t1=_itoll(TSCH,TSCL); code_wait_test; t2=_itoll(TSCH,TSCL); printf(“#cycle=%d”, t2-t1); short time (7 seconds) narrow range clock measurement: T1=TSCL; …process code … T2=TSCL; Printf(“#cycle=%d”, t2-t1); Method 2, you can also use the biosAPI method LgUns time1=CLK_gethtime(); …process code … LgUns time2=CLK_gethtime(); Cpucycles=(time2-time1)*CLK_cpucyclePerhtime; Prinf(“#cycle=%d”, Cpucycle);