Improvement strategy of TMR method based on FPGA-EEWORLD

Collect

SRAM-based FPGAs are very sensitive to space particle radiation and are prone to soft faults, so it is very important to take fault-tolerant measures for FPGA-based electronic systems to prevent such faults. The triple-module redundancy (TMR) method is widely used for fault-tolerant processing of single-particle upsets (SEL7) due to its simplicity of implementation and reliability of effect. However, the traditional TMR method has problems such as high consumption of system hardware resources and high power consumption. This paper summarizes the problems of the traditional TMR method, analyzes the advantages and disadvantages of some improved TMR methods that have emerged in recent years, points out improvement strategies for its existing problems, and looks forward to the development trend of TMR technology.

introduction

Soft faults are transient faults caused by the interaction between particles and PN junctions. Soft faults have a particularly serious impact on circuits implemented on SRAM-based FPGAs. Due to the simplicity and high reliability of Triple Modular Redundancy (TMR) technology, it is a widely used fault-tolerant technology for single-event upsets (SEU) on FPGAs. The literature shows that TMR greatly improves the reliability of FP-GA under the influence of SEU. Although TMR can effectively improve the reliability of the design, it consumes a lot of hardware resources and power consumption due to the implementation of additional modules and wiring, and the working speed is also affected. This limits the use of traditional TMR. With the development of electronic technology, especially partially reconfigurable technology, a variety of improved TMR technologies have emerged, which have specifically solved the problems existing in the traditional TMR method and enabled the development of TMR technology. This paper first introduces the principle of traditional TMR, then summarizes its existing problems, and then comprehensively analyzes the advantages and disadvantages of the improved TMR technology, and finally looks forward to the development trend of TMR technology.

1 Conventional TMR method and its problems

The basic concept of TMR is to use three identical modules to implement the same function respectively, and finally select the data at the output port through a majority voter to achieve the purpose of fault tolerance. The use of TMR is based on the premise that the error only occurs in one module at a certain moment. In fact, because the probability of simultaneous errors in different modules is relatively low, and the implementation process is direct and simple, TMR is now a relatively effective and widely used fault tolerance method. TMR is mainly widely used to prevent the impact of SEU caused by radiation on the system. Due to its use, the reliability of FPGA under the influence of SEU is greatly improved. The basic structure of the conventional TMR method is shown in Figure 1.

Although TMR can effectively improve the reliability of the design, it also has many shortcomings. The main ones are as follows:

(1) It cannot repair faulty modules. When a module fails, we simply mask the error through the majority voter, but the faulty module still exists. In addition, general TMR cannot detect and locate the error so that the system can repair it. If the error is not repaired in time, the TMR will fail when the error occurs again.

(2) Many studies only consider the impact of a single error and ignore the possibility of multiple SEUs occurring simultaneously. Although the probability of this happening is low, it does exist. Experiments also show that TMR is very effective in reducing the impact of a single SEU, but the accumulation of SEUs in the configuration memory will reduce the effect.

(3) Ordinary TMR has high resource overhead and low resource utilization. Ordinary TMR is to perform triple-module redundancy on the entire design or a larger module. The granularity is relatively large, and its resource overhead increases by 200% compared to the original circuit. If TMR cannot be implemented for the entire circuit or module due to design constraints such as FPGA hardware resources and power consumption, it will result in a waste of resources.

(4) The power consumption increases due to the doubling of the circuits, and the speed decreases due to the presence of the voter and some other additional wiring.

(5) The voting machine itself may also make mistakes, but the general TMR voting machine does not have the ability to self-detect errors and is not resistant to radiation.

(6) When a circuit with triple-mode redundancy drives a circuit without redundancy, a voter is required to combine three signals into one signal. When a circuit without redundancy drives a circuit with triple-mode redundancy, additional wiring is required to expand one signal into three signals. Because both logic circuits and wiring resources are sensitive to SEU, this result will reduce system reliability.

2 Improved TMR method

2.1 Combining TMR and Scrubbing

Since TMR itself does not have the ability to repair faulty modules, if only one module has an error, the system function will not be affected. However, if the faulty module cannot be repaired before another module has an error, the redundancy method will fail. Therefore, when an error occurs, the faulty module must be repaired in a timely manner.

With the development of dynamic reconfigurable technology, Scrubbing, a method for configuring FPGA, has emerged. Since the most serious impact on space electronic systems is soft faults such as SEU, which can be solved by reconfiguration, periodically refreshing the configuration memory can repair such errors.

Scrubbing and TMR can be used together to prevent SEUs. However, many studies only consider the impact of a single error and ignore the possibility of multiple SEUs occurring at the same time. In theory, a fast refresh rate can guarantee that only one error exists at a certain time. However, in practice, errors occur randomly, which means that no refresh rate can guarantee that at most one error occurs in a refresh cycle. When this method is used in practice, the probability of SEU occurrence must be estimated through a complex experimental process. The empirical principle for selecting the refresh rate of Scru-bbing is to make the refresh rate one order of magnitude higher than the estimated error rate. Now, as the scale of FPGAs becomes larger and larger, the time used to load the entire configuration bit stream will reach hundreds of milliseconds, the refresh rate cannot be guaranteed, and the system power consumption increases.

With the implementation of partial TMR, a voter with error detection and positioning functions can be designed. When a module fails, the voter signal directly triggers the reconstruction function, dynamically reconstructing only the circuit of the faulty part. This can solve the problems of scrubbing time and power consumption, and provide a solution to prevent error accumulation.

In order to prevent voting errors, the voting machine can be implemented with radiation-insensitive devices instead of SRAM-based materials, which improves the robustness of the voting machine. The literature also proposes an improved voting machine. It no longer uses the majority voter to vote on the outputs of the three redundant modules. Instead, the corresponding outputs of the three redundant modules are output by the three output pins of the FPGA through a tri-state buffer and a minority voter, and finally "wired or" into one signal on the printed circuit board (PCB). The minority voter circuit is responsible for determining whether the signal of this redundant module is a minority voter.

Value, if it is a minority value, the corresponding buffer outputs high impedance, if not, the corresponding signal is output normally.

Readback is developed on the basis of Scrubbing. It refers to comparing the configuration data read back with the original configuration data and reconstructing it after errors are found. This method is used in the literature, and it also uses error correction codes to protect the configuration data. The data of each configuration frame is protected by a 12-bit see-dec Hamming code, and the identification code of each basic unit in the FPGA is different. After reading back the configuration file through ICAP (Internal Configuration Access Port), the error correction code can give the location of the error.

Scrubbing can repair functional errors caused by SEU in LUT, routing matrix and CLB without interrupting circuit operation. However, it cannot change the contents of the trigger in LUT, so it cannot reset the state of the register. When the value of the storage unit is flipped, it can only be repaired by resetting the system. However, resetting the system will interrupt the system function and seriously affect the performance of the system.

2.2 Small-grained TMR technology

With the emergence of partial dynamic reconstruction technology, a small-granularity TMR method has emerged. TMR can be implemented with a smaller granularity as the step size, using reasonable layout and routing to achieve the required resource overhead and obtain maximum reliability. The literature has conducted an experimental analysis of the fault tolerance performance of TMRs of different granularities in the presence of multiple errors. The results show that small-granularity TMR is better than TMR with the entire system as the granularity.

In the case where global TMR is not feasible (for example, resources are limited), small-granularity TMR is a better choice, which can improve the reliability of the system while using fewer resources. Since redundancy measures are not used for all modules, the implementation must focus on applying TMR technology to those modules that can relatively improve the reliability of the system. At this time, the number and location of voters is also an issue that needs to be considered. Since additional wiring is required before and after the modules using triple-module redundancy, and both logic circuits and wiring resources are sensitive to SEU, this result will reduce the reliability of the system. As shown in Figure 2, the shaded part in the figure is sensitive to SEU. It can be seen from the figure that the sensitive part in (c) is more than that in (b), which is caused by the voter and additional wiring. Therefore, it is necessary to limit the transition between triple-module redundancy circuits and circuits that do not use triple-module redundancy, so that the system reliability can be improved by concentrating on the use of triple-module redundancy technology.

[page]

In order to select the modules that need triple-module redundancy and perform reasonable layout and wiring, the errors that occur in the system are divided into persistent errors and non-persistent errors. Persistent errors are errors generated by SEU that change the internal state of the circuit; non-persistent errors are errors that can be eliminated by FPGA reconstruction, while persistent errors still exist after reconstruction.

Based on the above analysis, the priority levels for implementing some TMRs are as follows:

The first level is the part where persistent errors occur.

The second level is a circuit that causes a portion of the circuit that can produce a continuity error to fail in order to reduce the conversion criteria between TMR and non-TMR.

The third stage is the forward portion of the circuit that will generate persistent errors, also based on the principle of reducing the transition between TMR and non-TMR.

The fourth stage is independent of the portion of the circuit that produces persistent errors.

The circuit can be partitioned through static analysis. The problem here is that in the standard global TMR, all inputs, outputs and clocks are triple-module redundant, but when using partial TMR, redundancy for I/O and clock may not be achieved. Just like logic circuits without TMR, clocks and I/O without TMR can also produce undetectable errors.

From the experimental results, we can see that since this method mainly focuses on the circuit part that can produce persistent errors, when the redundant resources used increase, the probability of persistent errors decreases rapidly, and is eventually almost completely overcome. Therefore, the use of partial TMR can achieve a balance between resources and reliability, and maximize resource utilization while minimizing the impact on reliability.

Additionally, a flip may change the configuration bits in the configuration memory that control the wiring, causing a short circuit between two different redundant modules. Such a flip affects more than one module in the TMR, causing output errors. Since 90% of the configuration resources are used to control the wiring, this issue needs to be considered. The possibility of such errors also depends on the layout of the TMR, which is directly dependent on the number of majority voters. When the number of voters increases, additional connections are required between modules, so the modules must be close together, which increases the possibility that a flip will cause a short circuit between modules. In order to reduce the possibility of errors that change the wiring and affect the robustness of the TMR, the connections between modules must be minimized as much as possible. If the number of majority voters can be reduced, the connections between modules can be reduced.

The solution to this problem is to use TMRs with larger granularity to reduce the connections between them, and the voter is only used in the circuit output part. But at the same time, a new problem arises. For example, in order to repair the state of the storage unit, Xilinx proposed the XTMR method of applying the voter to the place where there is a register and adding feedback to correct the impact of the flip on the value in the register. If the internal voter is removed, the error of the storage unit in the error part will not be corrected, so after reconstruction, it will be out of sync with the working state of other modules.

The method used in the literature to solve this problem is to read the stored working status from an external memory. However, this method requires the three modules to be offline before the status can be stored and reset to achieve synchronization. Obviously, this is not feasible, especially for circuits with high real-time requirements.

To address this problem, the literature proposes a synchronization technology that uses a status register replication mechanism in the TMR system and introduces a data path between the three redundant modules to transmit the status register data. When a redundant module fails and is partially reconstructed, it can complete synchronization and resume work in a timely manner by accepting the token of the main controller and copying the status register data from the normal module. This method shortens the time from repairing the faulty module to rejoining the system, thereby reducing the probability of accumulated faults and improving the reliability of the redundant system.

Another method is to predict the state that other modules will reach the fastest, and then preset the state of the reconstructed module. In this case, you only need to stop the module to be reconstructed without affecting the work of the other two modules. When the state of the working module is synchronized with the preset state, the three modules work together again. At this time, state selection is a problem. The frequency of reaching this state must be very high and it will be reached in a relatively fast time. In addition, multiple states can improve efficiency but at the same time increase the preset signal width.

If the state of the register is unpredictable, such as register chains and adders, this method cannot be used, so this method needs to be improved to improve its adaptability. When wiring the TMR modules, try to separate them by a certain distance, which can also reduce the possibility of affecting other modules after an error in one module. 2.3 TMR technology based on the improvement of the basic unit structure of FPGA

Because TMR consumes a lot of resources, the smaller the granularity, the more resources it consumes due to wiring and voter implementation in general methods. The literature proposes a relatively novel small-granularity method, which makes some changes to the structure of SRAM-based FPGA (LUT and CLB structure) to implement TMR with small granularity but reduce resource consumption.

Xilinx Virtex-5 series chips are used in this paper. Its basic structure is a 6-input LUT implemented by two 5-input LUTs. As shown in Figure 3. If the lower five bits of input data are the same, then two 5-input LUTs can be implemented. In some cases, only a 5-input LUT is needed, and the other LUT is not used. If this situation is always met during layout, the remaining resources can be used to implement TMR.

Virtex-5 Series LUT Structure

One TMR requires two LUTs, while two TMRs only require three LUTs. Voters, error reporting circuits, and other control lines need to be implemented inside the LUT, which increases the wiring and delay inside the LUT.

The advantages of this method are reduced granularity, increased reliability, less resource consumption, and conditional reconstruction through error detection and positioning, which reduces power consumption and configuration time. Experimental results show that compared with the traditional small-granularity TMR, this method consumes only 76.5% of the additional resources, while the traditional one reaches 242%.

2.4 TMR technology based on spatial search method

Due to the emergence of partial and small-granularity TMR, the selection of granularity and circuit modules is a key issue under FPGA and design constraints. Often, only the requirements of resources, power consumption and reliability are known, and the actual layout and implementation is a thorny issue. The literature has proposed a method based on spatial search. This method provides parameters such as resources, power consumption and reliability, and searches among various possible solutions to obtain the best result.

2.5 Time-based TMR technology

The basic idea of time-based fault shielding is to perform faults through multiple calculations, that is, to repeat the same calculation twice or more times and compare the results to detect and overcome errors. When a result is obtained for a certain part of the circuit, it is temporarily stored, and then the calculation is performed again after a certain delay and the output is stored. If the comparison results are inconsistent, an error has occurred. At this time, the same delay is made and the output is output as the correct result.

This method is very effective for detecting instantaneous faults, but its fault tolerance effect is related to the delay time. This method actually prolongs the use time in exchange for saving resources, and is less practical for systems with high real-time requirements.

2.6 TMR technology based on software and hardware redundancy

For irreparable damage to the hardware, the above methods will fail. In this case, three different versions of files are used in each module (one for use and two for backup) and each module has 1/4 hardware redundancy resources. If a hardware failure occurs, it is first reconfigured with other versions. If this does not solve the problem, then the layout is re-arranged by using additional redundant resources to bypass the faulty part. However, due to the requirements for redundant resources and storage units, this method further increases resource consumption.

3. Prospects of TMR technology development

Based on the above analysis, the relationship between the problems and improvement methods of TMR technology is described by the block diagram shown in Figure 4. Due to the problem of hardware failure accumulation, a variety of new TMR-based solutions have been produced under the requirement of system reliability. However, these technologies are only proposed for certain problems. It only solves some problems and also brings some new problems. Therefore, the fault-tolerant technology based on TMR is still not mature enough.

However, the small-granularity TMR technology is a very flexible method. It can achieve better performance on the basis of saving resources when combined with other methods. The technology based on small-granularity TMR will be a major development direction of TMR technology, and the impact of the relatively increased wiring resources on system reliability needs to be further resolved. In addition, since the implementation of small-granularity TMR requires the selection and layout of various circuits in the system, the automation of TMR implementation is also a direction that needs to be studied.

4 Conclusion

The outstanding problems of TMR technology are summarized, these new methods are studied, their advantages and problems are analyzed, and corresponding solutions are pointed out. The development of TMR technology should be oriented towards efficient implementation methods and reliability, based on a robust evaluation strategy, and according to the parameter requirements to be achieved, the final TMR solution should be obtained by weighing different granularities and layouts in a highly automated manner.

Keywords：TMR FPGA Reference address：Improvement strategy of TMR method based on FPGA

Previous article：FPGA Solutions for Defibrillators
Next article：FPGA-based SoC verification platform realizes circuit simulation and debugging

Recommended ReadingLatest update time:2024-11-16 17:46

Color recognition system based on FPGA and color-sensitive sensor

1 Overview In today's social life, color recognition is increasingly widely used. The wide application needs in various fields have led to the rapid development of color recognition technology. Combined with other technologies, it can better serve multiple industries such as industrial control and pr

[Embedded]

Color recognition system based on FPGA and color-sensitive sensor

Design of FIR digital filter based on FPGA (Part 1)

In the Matlab/Simulink environment, the DSP Builder module was used to build the FIR model, and the FIR filter was designed according to the FDATool tool. Then, system-level simulation and ModelSim functional simulation were performed. The simulation results show that the filtering effect of the digital filter is good

[Analog Electronics]

Design of FIR digital filter based on FPGA (Part 1)

Detailed explanation of FPGA core knowledge (3): Simulation issues that confuse FPGA beginners

For FPGA beginners, how to correctly understand and comprehend FPGA simulation is the key. In response to the requirements of FPGA beginners and enthusiasts, the editor of Electronics Enthusiasts Network has compiled and shared the following concise introduction to various FPGA simulations based on the introduction of

[Analog Electronics]

Detailed explanation of FPGA core knowledge (3): Simulation issues that confuse FPGA beginners

A Low-Power-Aware FPGA Design Approach for Portable Products

ILGOO series low power FPGA products Actel's ILGOO series devices are low-power FPGA products and are the best solution to replace ASIC and CPLD in portable product design. Its static power consumption in Flash*Freeze mode can reach as low as 2µW, and the battery life is more than 5 times that of pr

[Embedded]

A Low-Power-Aware FPGA Design Approach for Portable Products

Microchip launches PIC16 microcontroller with integrated micro FPGA, priced at less than 50 cents

Compiled from EEJOURNAL Microchip now offers a flash-based microcontroller with integrated programmable logic blocks for less than 50 cents. The nine new products in Microchip's PIC16F13145 series use the same 8-bit RISC microprocessor architecture as other models in the 16F series, but they also integrate a new blo

[Microcontroller]

Using FPGA to solve DSP design challenges

DSPs are important in electronic system design because they can quickly measure, filter, or compress real-time analog signals. In this way, DSPs help enable the communication between the digital world and the real (analog) world. But as electronic systems become more sophisticated and need to process multiple analog

[Embedded]

Using FPGA to solve DSP design challenges

A complete solution for low-cost digital chip automatic tester based on FPGA

Project background and feasibility analysis Project name: Research and development of low-cost digital chip automatic tester based on FPGA Research purpose: To use the VertexⅡ Pro development board system to implement functional testing of Flash memory. Research Background: With the increasing complexity of circuits a

[Power Management]

Design of AC Current Measuring Instrument Based on FPGA

In the power dispatching automation system, measuring voltage and frequency is the most important function. How to collect data quickly and accurately is particularly important. At present, according to the different collected signals, there are two methods: DC sampling and AC sampling. Although DC sampling is simpl

[Test Measurement]

Design of AC Current Measuring Instrument Based on FPGA

Popular Resources
Popular amplifiers