Overview
In modern electronic systems, system failures caused by "soft" errors are more common than those caused by permanent hardware failures. "Soft" errors refer to recoverable faults or failures. This type of error is prone to occur in registers, RAM, etc., and is caused by bit value flips due to electromagnetic interference or the influence of alpha particles, cosmic rays, etc. in the packaging material. To address this problem, parity checks and ECC circuits can be added to the hardware for detection.
The Flash, SRAM and CACHE of the STM32H7 series MCU all support ECC functions. In this article, we mainly introduce the SRAM ECC function and the precautions in application.
RAMECC peripheral
In the STM32H7 series MCU, there is a peripheral called RAMECC, which is a RAM ECC Monitor. RAMECC provides an interface for the application to detect the current RAM ECC status and execute the corresponding recovery or error reporting procedures when an ECC error occurs.
The RAM ECC of STM32H7 supports correcting single-bit errors and detecting double-bit errors. For AXI SRAM and TCM RAM, 8-bit ECC code is added for every 64-bit data; for other 32-bit bus SRAM, 7-bit ECC code is added for every 32-bit data.
When writing to SRAM, the hardware automatically calculates and saves the ECC value, and automatically checks when reading or unaligned writing (read-modify-write) to SRAM, and the erroneous address and data can be read out through the register.
The implementation of the RAM ECC function of STM32H7 can be divided into two parts: RAM ECC Controller and RAM ECC Monitor unit, as shown in the figure below.
About ECC Controller
The SRAM of STM32H7 is divided into AXI SRAM, SRAM1, SRAM2, SRAM3, SRAM4, data TCM RAM, instruction TCM RAM and backup SRAM. Each RAM block corresponds to an ECC Controller. The
ECC Controller is always enabled. It is responsible for the calculation, storage, comparison and error detection of the ECC code, and can complete the functions of single-bit error correction and double-bit error detection.
About ECC Monitor
STM32H7 has three ECC Monitors, each responsible for one domain. The ECC Monitor receives diagnostic events from the ECC Controller and generates corresponding interrupt signals according to the register configuration.
The mapping relationship between ECC Controller and ECC Monitor is shown in the figure below. For example, the RAMECC Monitor unit of D1 has a total of 5 channels, each of which corresponds to the ECC controller of an SRAM block. Each channel has its own set of registers, and the Address offset in the figure is the offset address of the register group. If you want to turn on the ECC Monitor unit of the AXI SRAM so that the corresponding interrupt is generated when the ECC error of the AXI SRAM is detected, you need to operate the register group corresponding to the AXI SRAM.
FAR and FDR Registers
RAMECC supports single-bit ECC error interrupt, double-bit ECC error interrupt and ECC interrupt caused by non-aligned write operation (Byte Write). These interrupts can be configured and enabled in the IER and CR registers of RAMECC respectively. The status of these interrupts can be checked in the SR register. The operations of these registers are clear at a glance. Here I want to explain the other two registers: the error address register FAR and the error data register FDR.
After enabling the ECCELEN bit in the CR register, when an ECC error (single-bit/double-bit error) occurs, the erroneous address and data will be locked into the FAR and FDR registers.
The FAR register stores the relative address. The actual error address is calculated as follows:
The actual error address = the starting address of SRAM + the value of FAR register * N (N = 4 or 8).
There are two FDR registers. For SRAM data of 64-bit bus, the FDRL register stores the lower 4 bytes of data, and the FDRH register stores the upper 4 bytes of data. For SRAM data of 32-bit bus, the data is stored in the FDRL register, and the value of FDRH is 0.
Let's look at the following two examples:
Example 1: Enable the Monitor function corresponding to AXI-SRAM (please refer to the RAMECC_ErrorCount routine in the STM32H7CUBE library for how to enable it). After power-on, if AXI-SRAM is not initialized first and then a read operation is performed directly, an ECC error will be triggered. At this time, we check the values of FAR and FDR in the debugging state, as shown in the figure below.
Here, because AXI-SRAM is a 64-bit bus interface, the value of N is 8 when calculating the actual error address.
Example 2: Enable the Monitor function corresponding to SRAM1 (0X30000000). After power-on, if the initialization is not performed and the read operation is performed directly, an ECC error will be triggered. At this time, we check the values of FAR and FDR in the debugging state, as shown in the figure below.
Because SRAM1 is a 32-bit bus interface, the value of N is 4 at this time.
How to use RAM ECC correctly in applications
When using RAM that supports ECC, it is important to initialize the RAM, otherwise an ECC error may be reported. Just like what we did in the experiment in the previous section, we can simulate ECC errors by not initializing it. The recommended initialization steps are given in AN5342.
Single-bit ECC errors can be automatically corrected during the read process, but only the read data is correct. In order to prevent the accumulation of errors, which may cause a single-bit error to become a double-bit error, the correct value can be written back to the SRAM after the single-bit error is detected. There are two ways to do this. If the value originally saved in the SRAM is backed up in the Flash, then the value in the Flash can be directly written back to the SRAM; or the FAR and FDR registers mentioned above can be used to write the correct value back to the SRAM.
Proactively discovering SRAM failures through periodic ECC testing is also a way to improve system reliability. ECC testing can be performed by reading the value of SRAM. The test does not need to be completed all at once, and the SRAM can be tested in segments when the system is idle. Please refer to AN5342 for more details.
|