What is inter-core communication in a multi-core processor system[Copy link]
Inter-core communication is the main difficulty faced by multi-core processor systems. The 8-core DSP processor TMS320C6678 based on the KeyStone architecture has a frequency of 1.25 GHz for each C66x core, providing up to 40 GB MAC fixed-point operations and 20 GB FLOP floating-point operations per second. An 8-core TMS320C6678 provides an equivalent core frequency of 10 GHz, and the theoretical single-precision floating-point parallel computing capability can reach 160 GB FLOP, which is 50 times that of TS201S and 115.2 times that of C67x+[1]. It is suitable for ultra-high performance computing applications such as oil and gas exploration, radar signal processing, and molecular dynamics that have high requirements for fixed-point and floating-point computing capabilities and real-time performance. The quality of the communication mechanism directly affects the performance of the multi-core processor. An efficient communication mechanism is an important guarantee for the high performance of the multi-core processor. TMS320C6678 uses TI's new KeyStone multi-core architecture, which is a single-chip multi-core architecture and is different from the common on-board multi-chip communication method [2]. However, the research on the KeyStone architecture communication has just started. Due to the complexity of multi-core communication, it is necessary to build a suitable communication topology. Therefore, the selection of the topology will directly affect the communication cost and the efficiency of parallel computing [3]. TMS320C6678 uses an interrupt controller based on the KeyStone architecture, inter-core communication registers and a suitable communication topology to achieve multi-core communication. The processor is activated through the interrupt system, the interrupt service program with communication function is triggered, the register is called to complete the corresponding function, and the communication is completed through a suitable topology. Based on the above analysis, this paper analyzes the interrupt controller and inter-core interrupt principles and their implementation for the TMS320C6678 multi-core processor; then analyzes the principle of inter-core communication and gives the implementation method of communication initiation and response; finally, introduces the two multi-core communication topologies of master-slave and data stream, and compares their communication costs through simulation, and draws the advantages and disadvantages of the two structures and their scope of application. It has certain guiding significance for the design of inter-core communication of multi-core processors. 1 TMS320C6678 interrupt controller TMS320C6678 uses the interrupt controller INTC (Interrupt Controller) [4] based on the KeyStone architecture, activates the processor to trigger the corresponding interrupt service program, and completes the first step of communication. First, you need to configure the interrupt vector table and enable the CPU interrupt function. The CPU of TMS320C6678 can receive 15 interrupts, including: 1 hardware exception (EXCEP), 1 non-maskable interrupt (NMI), 1 reset (RESET) and 12 maskable interrupts (INT4~INT15). The interrupt source supports up to 128. Each core generates an event (Event) through the event controller, triggering the inter-core interrupt (IPI) to communicate with other cores. In TMS320C6678, the inter-core interrupt (IPC_LOCAL) corresponds to event 91 by default, and the inter-core interrupt is a maskable interrupt. It can be mapped to any interrupt from INT4 to INT15 through the interrupt controller. In order to implement inter-core interrupts, the following settings must be made: (1) The global interrupt enable bit in the control status register (CSR) is set to 1, and global interrupts are enabled; (2) The NMIE bit in the interrupt enable register (IER) is set to 1, and maskable interrupts are enabled; (3) The corresponding bit of the maskable interrupt to be mapped in the interrupt enable register (IER) is set to 1; (4) Event 91 is selected as the interrupt source, and the event is mapped to the specified physical interrupt number. After the interrupt occurs, the corresponding bit of the interrupt flag register (IFR) is set to 1. When an interrupt occurs, the interrupt service routine (ISR) is jumped into by the pre-configured interrupt vector table to complete the inter-core communication, as shown in Figure 1. SRCS0~SRCS27) provides the ability to identify up to 28 interrupt sources. Bits 0 to 3 of the IPC interrupt confirmation register IPCARx (0≤x≤7) are reserved bits, and bits 4 to 31 (SRCC0~SRCC27) correspond to 28 different interrupt sources. When SRCSx is set to 1, the register sets the SRCCx bit of the corresponding interrupt confirmation register to 1. When the interrupt is confirmed, the register sets both SRCCx and the corresponding SRCSx bit to 0. When a processor core of the TMS320C6678 is ready to communicate with other processor cores, according to the interrupt event mapping table of the TMS320C6678, event 91 is triggered, a maskable inter-core interrupt is generated, and the interrupt service routine is called. The interrupt service routine IPC_ISR function is designed as follows:
void IPC_ISR()
{
KICK0 = KICK0_UNLOCK;
KIC K1 = KICK1_UNLOCK;
*(volatile uint32_t *) IPCGR[2] = 0x20;
*(volatile uint32_t *) IPCGR [2] |= 1;
KICK0 = KICK0_UNLOCK;
KICK1 = KICK1_UNLOCK;
}
[color=rgb(51, 102, 153) !important]Copy code Take the interrupt of sending 0x20 information to core_2 as an example. The corresponding 0x20 information is stored in the SRCS bit to identify the interrupt source. At the same time, the last bit IPCG of the interrupt generation register IPCGR2 in the current CPU core is set to 1 to trigger the IPC interrupt. When the target processor core is triggered by the interrupt, it will automatically jump to the corresponding entry point in the interrupt exception vector table, read the current core interrupt generation register IPCGRx (0≤x≤7), and obtain the inter-core information sent by the communication initiator from the SRCS bit of the register. Then store the information in the corresponding interrupt confirmation register IPCARx, clear the SRCC and corresponding SRCS bits, and receive the next inter-core interrupt. KICK0 and KICK1 are trap control registers used to avoid communication conflicts. 3 Topology Design and Performance Test The above analysis has been conducted on the basic inter-core communication mechanism and its implementation process of TMS320C6678. However, to realize the powerful multi-core function of TMS320C6678, a good parallel computing solution must be designed from the system perspective. Designing a suitable system parallel topology is the key. Communication cost, bandwidth and function are important indicators for evaluating communication. The following introduces two multi-core communication parallel methods, analyzes their topological structures, and conducts test comparisons on the above indicators. 3.1 Communication Topological Structure There are two parallel methods suitable for multi-core DSP communication: one is the master-slave topological structure (Master Slave) [5], and the other is the data flow topological structure (Data Flow) [6]. The master-slave topological structure is shown in Figure 2 in TMS320C6678. The processor as the master core (control core) exchanges data with the external memory DDR through EDMA, and then the master core communicates with the slave core through inter-core interrupts. The main core plays a control role, and all interrupts of auxiliary cores (computing cores) are handled by the control core. The auxiliary cores are only responsible for computing tasks, and there is no inter-core communication between auxiliary cores. 3.2 Performance Test Experiment This paper designs an inter-core communication test program to test the two structures. The function of the program is: when a core receives an interrupt from other cores, it immediately confirms and issues inter-core interrupts in sequence according to the topology structure, without other time-consuming operations. The program is simulated on the TMDXEVM6678L evaluation board, which has a TMS320C6678 chip on board. The processor frequency is set to 1 GHz, and the compilation environment used is TI's CCSv5.0. The communication test results are shown in Table 1. The total communication cost required for the main and auxiliary structures to run the test program is 171,352 clock cycles. Among them, core_0 as the main core consumes 116,311 clock cycles, and each of the 7 auxiliary cores consumes 7,863 clock cycles. The total communication cost required to run the test program using the data flow structure is 171,319 clock cycles, of which core_0 consumes 21,385 clock cycles, core_7 consumes 21,366 clock cycles, and the other six cores consume 21,428 clock cycles.