[Beineng cost-effective ATSAMD51 evaluation board] RAMECC detailed explanation and test

qinyunti · Published on 2022-11-30 19:23

[Beineng cost-effective ATSAMD51 evaluation board] RAMECC detailed explanation and test [Copy link]

Preface

In industrial products and aerospace products that are used in harsh environments, the content of RAM may cause a small probability of random bit flipping due to environmental factors, such as the impact of high-energy particles in the universe. Therefore, in these scenarios, the chip is generally required to have an ECC function that can detect and correct RAM errors to improve system reliability. This is a must-have function especially for aerospace products.

The SAM-D5x MCU has a RAMECC function, which supports one bit error correction and two bits of error detection. Generally speaking, this is a standard feature. If you want to support more bits of error correction and error detection, more additional storage and computing load are required. Generally, the probability of multiple bit errors is very low, so for balanced considerations, it is generally supported to support one bit error correction and two bits of error detection.

We will analyze and test this function.

The principle of ECC can be searched online. Here is a basic formula: to detect n bits of original data, m bits of check data need to be added (here only one bit of error is considered)? Given n, what is the minimum value of m?

There are n original data and m check data, so there are actually n+m bits of data (because the check data itself can also be wrong and needs to be detected), so there are n+m situations where only 1 bit is wrong, and there is also a correct state, so there are n+m+1 states. To represent so many states, using m bits, it must satisfy

2^m≥ m + n+ 1

For example, for 8-bit raw data, m needs to be at least 4, 2^4=16>=8+4+1=13.

The above is the case of detecting a 1-bit error, with only row check. If you want to detect a 2-bit error, you need to add 1 bit and a column check. So 8 bits of data require 5 check bits, which is our common configuration.

So we see the RAMECC module, block diagram

32-bit data is divided into 4 groups in bytes, and an additional 4x5 bits of checksum information are required.

The working logic of ECC is to calculate and update the original value and ECC value at the same time when writing, and to read the original value when reading, calculate the ECC and compare it with the stored ECC, and return the corrected value to the CPU if it is a bit error. It should be noted here that the error correction of a bit means that the corrected value is returned to the CPU when reading, not that the value in the RAM is automatically corrected, so the software needs to automatically write the correct value back to the RAM according to the error address in the ECC interrupt to complete the correction.

Only when an address with an error check value is read will an interrupt be triggered. If this address is not read, it will not be triggered even if there is an ECC error. Therefore, the software needs to periodically read the entire RAM to trigger the ECC test. This timing time cannot be too short, which will increase the software processing burden, and it cannot be too long, which may cause a single error to accumulate into multiple errors and cannot be recovered. This requires a balance. In addition, reading RAM must also consider mutual exclusion with normal program execution. Generally, read processing is required in the critical section, and the time to scan and read the entire RAM cannot be ignored. When a single-bit error is detected and software correction and recovery are performed, the near-stage processing must also be considered to avoid continuing to access the error value in higher priority interrupts.

RAMECC Introduction

See the manual section

《17. RAMECC – RAM Error Correction Code (ECC)》

structure

8-bit data corresponds to 5-bit check information, supports 1-bit error correction and 2-bit error detection, ECCDIS controls whether to enable, and effective storage is halved after enabling

clock

The clock comes from AHB-APB B

MCLK.APBBMASK.RAMECC control is enabled and is enabled by default reset.

MCLK_REGS->MCLK_APBBMASK |= 1<<16;

Register write protection

I have mentioned it in the previous article, so I won’t repeat it here.

PAC_REGS->PAC_WRCTRL = (1<<16) | (32*1+16);

Interrupt

ARM\PACK\Microchip\SAMD51_DFP\3.6.120\samd51a\include\samd51p20a.h

Check the interrupt number in

RAMECC_IRQn = 45, /* 45 RAM ECC (RAMECC) */

\RTE\Device\ATSAMD51P20A\startup_samd51p20a.s中

Interrupt service function

DCD RAMECC_Handler ; 45 RAM ECC Handler

Flag Clear

INTFLAG.SINGLEE and INTFLAG.DUALE bits are both cleared on ERRADDR read.

register

INTENCLR : Controls clearing single error and double error interrupt enable

INTENSET: Control setting single error and double error interrupt enable

INTFLAG: interrupt flag, read ECCADDR to clear the flag

STATUS: ECC enable status, read-only, NVM User Row configuration, loaded to this bit at startup

ERRADDR: Error address

DBGCTRL: Behavior configuration during debugging, controls whether ECC errors are recorded during simulation reading.

Enable

Bit39 of UROW controls whether to disable RAM ECCDIS. The default value is 1, disabled.

You need to modify RAM ECCDIS to 0.

RAMECC Test

Testing RAM ECC requires triggering ECC errors, which are generally done in the following ways:

The chip has its own error injection function, but our chip does not have
If you do not initialize the RAM, the contents of the RAM will be random or all 0, and the ECC will be incorrect.
Use physical methods such as lasers to simulate high-energy particle impacts and ECC errors caused by bit flips.

We can only use method 2 to test our conditions.

We test according to 2

First, after the program starts and before accessing the RAM, the RAM must be initialized, that is, it must be written to ensure that the initial value of the ECC is correct.

Here we intentionally leave the last 4 bytes uninitialized, and read the RAM in the main program to see if an ECC error occurs when reading this uninitialized location.

ECC initialization code

void ramecc_init(void)

{

MCLK_REGS->MCLK_APBBMASK |= 1<<16;

PAC_REGS->PAC_WRCTRL = (1<<16) | (32*1+16);

RAMECC_REGS->RAMECC_INTENSET = (1u<<0) | (1u<<1);

NVIC_SetPriority(RAMECC_IRQn, (1UL << __NVIC_PRIO_BITS) - 1UL);

NVIC_EnableIRQ(RAMECC_IRQn);

}

void RAMECC_Handler ( void )

{

volatile uint32_t erraddr;

erraddr = RAMECC_REGS->RAMECC_ERRADDR;

erraddr = erraddr;

}

The RAM initialization code is as follows

#define RAM_BASE 0x20000000

#define RAM_SIZE (128*1024ul)

void ram_init(void);

void ram_init(void)

{

uint32_t i;

for(i=0;i<RAM_SIZE-4;i++)

{

*(volatile uint8_t*)(RAM_BASE+i)=0x55;

}

Call in the startup assembly code

IMPORT SystemInit

IMPORT __main

IMPORT ram_init

LDR R0, =ram_init

BLX R0

LDR R0, =SystemInit

BLX R0

LDR R0, =__main

BX R0

ENDP

Then read RAM in the main function

void ram_read(void);

void ram_read(void)

{

volatile uint32_t i;

for(i=0;i<RAM_SIZE;i++)

{

if(*(volatile uint8_t*)(RAM_BASE+i) != 0x55)

{

}

int main(void)

{

NVIC_SetPriorityGrouping(0);

//PAC_REGS->PAC_WRCTRL = (1<<16) | (32*1+4);

PORT_REGS->GROUP[2].PORT_DIRSET = 1<<18; //PC18 OUTPUT

ramecc_init();

while(1)

{

ram_read();

}

return 0;

}

The download program simulation runs into an interrupt as follows

You can see that the value of ERRADDR is exactly 0x0001FFC, and the dual error is set.

Note that the error address here is the offset relative to the RAM start address

Continue to execute read ERRADDR status clear

The above test mainly involves retesting after power off, simulating the initialization of the RAM value as the power-on reset value.

If we change the ram_init initialization range to initialize all 128KB, ECC errors will not be triggered, which verifies the correctness of RAMECC.

There are many things to pay attention to when handling RAMECC error interrupts.

Single-bit errors can be corrected by writing the correct value back in. Single-bit errors read out the corrected value, but the software needs to write it manually to correct the value in RAM. It should be noted here that there should be no high-priority interrupt preemption in the interrupt to operate the value in this place.

The software cannot recover from multi-bit errors at this time and can only be handled according to the actual strategy, or corresponding strategies such as software reset. This depends on the actual situation. Sometimes security-related systems cannot be directly reset by software, and may require some recovery or machine switching strategies, etc. This will not be discussed here.

A simple interrupt handler is as follows

Of course, there are many situations that have not been taken into account, such as the shielding of higher priority interrupts. This is just for demonstration. Corrective recovery strategies must also be determined based on actual considerations.

void RAMECC_Handler ( void )

{

volatile uint32_t erraddr;

erraddr = RAMECC_REGS->RAMECC_ERRADDR;

if(RAMECC_REGS->RAMECC_INTFLAG & 0x02)

{

//dual err reset or other

}

else if(RAMECC_REGS->RAMECC_INTFLAG & 0x01)

{

//single err

*(volatile uint8_t*)erraddr = *(volatile uint8_t*)erraddr;

}

Summarize

This article introduces the RAMECC principle and tests RAMECC. This chip supports RAMECC single-bit error correction and 2-bit error detection, which can greatly improve the reliability of the chip. This is also a standard for security-related application scenarios. With the trend of cost reduction in commercial aerospace, industrial-grade chips are also being used in large quantities. Chips with RAMECC are undoubtedly a basic element in selection, so this chip has the basic elements for use in the commercial aerospace field.