SRAM-based FPGAs are very sensitive to space particle radiation and are prone to soft faults, so it is very important to take fault-tolerant measures for FPGA-based electronic systems to prevent such faults. The triple-module redundancy (TMR) method is widely used for fault-tolerant processing of single-particle upsets (SEL7) due to its simplicity of implementation and reliability of effect. However, the traditional TMR method has problems such as high consumption of system hardware resources and high power consumption. This paper summarizes the problems of the traditional TMR method, analyzes the advantages and disadvantages of some improved TMR methods that have emerged in recent years, points out improvement strategies for its existing problems, and looks forward to the development trend of TMR technology.
introduction
Soft faults are transient faults caused by the interaction between particles and PN junctions. Soft faults have a particularly serious impact on circuits implemented on SRAM-based FPGAs. Due to the simplicity and high reliability of Triple Modular Redundancy (TMR) technology, it is a widely used fault-tolerant technology for single-event upsets (SEU) on FPGAs. The literature shows that TMR greatly improves the reliability of FP-GA under the influence of SEU. Although TMR can effectively improve the reliability of the design, it consumes a lot of hardware resources and power consumption due to the implementation of additional modules and wiring, and the working speed is also affected. This limits the use of traditional TMR. With the development of electronic technology, especially partially reconfigurable technology, a variety of improved TMR technologies have emerged, which have specifically solved the problems existing in the traditional TMR method and enabled the development of TMR technology. This paper first introduces the principle of traditional TMR, then summarizes its existing problems, and then comprehensively analyzes the advantages and disadvantages of the improved TMR technology, and finally looks forward to the development trend of TMR technology.
1 Conventional TMR method and its problems
The basic concept of TMR is to use three identical modules to implement the same function respectively, and finally select the data at the output port through a majority voter to achieve the purpose of fault tolerance. The use of TMR is based on the premise that the error only occurs in one module at a certain moment. In fact, because the probability of simultaneous errors in different modules is relatively low, and the implementation process is direct and simple, TMR is now a relatively effective and widely used fault tolerance method. TMR is mainly widely used to prevent the impact of SEU caused by radiation on the system. Due to its use, the reliability of FPGA under the influence of SEU is greatly improved. The basic structure of the conventional TMR method is shown in Figure 1.
Although TMR can effectively improve the reliability of the design, it also has many shortcomings. The main ones are as follows:
(1) It cannot repair faulty modules. When a module fails, we simply mask the error through the majority voter, but the faulty module still exists. In addition, general TMR cannot detect and locate the error so that the system can repair it. If the error is not repaired in time, the TMR will fail when the error occurs again.
(2) Many studies only consider the impact of a single error and ignore the possibility of multiple SEUs occurring simultaneously. Although the probability of this happening is low, it does exist. Experiments also show that TMR is very effective in reducing the impact of a single SEU, but the accumulation of SEUs in the configuration memory will reduce the effect.
(3) Ordinary TMR has high resource overhead and low resource utilization. Ordinary TMR is to perform triple-module redundancy on the entire design or a larger module. The granularity is relatively large, and its resource overhead increases by 200% compared to the original circuit. If TMR cannot be implemented for the entire circuit or module due to design constraints such as FPGA hardware resources and power consumption, it will result in a waste of resources.
(4) The power consumption increases due to the doubling of the circuits, and the speed decreases due to the presence of the voter and some other additional wiring.
(5) The voting machine itself may also make mistakes, but the general TMR voting machine does not have the ability to self-detect errors and is not resistant to radiation.
(6) When a circuit with triple-mode redundancy drives a circuit without redundancy, a voter is required to combine three signals into one signal. When a circuit without redundancy drives a circuit with triple-mode redundancy, additional wiring is required to expand one signal into three signals. Because both logic circuits and wiring resources are sensitive to SEU, this result will reduce system reliability.
2 Improved TMR method
2.1 Combining TMR and Scrubbing
Since TMR itself does not have the ability to repair faulty modules, if only one module has an error, the system function will not be affected. However, if the faulty module cannot be repaired before another module has an error, the redundancy method will fail. Therefore, when an error occurs, the faulty module must be repaired in a timely manner.
With the development of dynamic reconfigurable technology, Scrubbing, a method for configuring FPGA, has emerged. Since the most serious impact on space electronic systems is soft faults such as SEU, which can be solved by reconfiguration, periodically refreshing the configuration memory can repair such errors.
Scrubbing and TMR can be used together to prevent SEUs. However, many studies only consider the impact of a single error and ignore the possibility of multiple SEUs occurring at the same time. In theory, a fast refresh rate can guarantee that only one error exists at a certain time. However, in practice, errors occur randomly, which means that no refresh rate can guarantee that at most one error occurs in a refresh cycle. When this method is used in practice, the probability of SEU occurrence must be estimated through a complex experimental process. The empirical principle for selecting the refresh rate of Scru-bbing is to make the refresh rate one order of magnitude higher than the estimated error rate. Now, as the scale of FPGAs becomes larger and larger, the time used to load the entire configuration bit stream will reach hundreds of milliseconds, the refresh rate cannot be guaranteed, and the system power consumption increases.
With the implementation of partial TMR, a voter with error detection and positioning functions can be designed. When a module fails, the voter signal directly triggers the reconstruction function, dynamically reconstructing only the circuit of the faulty part. This can solve the problems of scrubbing time and power consumption, and provide a solution to prevent error accumulation.
In order to prevent voting errors, the voting machine can be implemented with radiation-insensitive devices instead of SRAM-based materials, which improves the robustness of the voting machine. The literature also proposes an improved voting machine. It no longer uses the majority voter to vote on the outputs of the three redundant modules. Instead, the corresponding outputs of the three redundant modules are output by the three output pins of the FPGA through a tri-state buffer and a minority voter, and finally "wired or" into one signal on the printed circuit board (PCB). The minority voter circuit is responsible for determining whether the signal of this redundant module is a minority voter.
Value, if it is a minority value, the corresponding buffer outputs high impedance, if not, the corresponding signal is output normally.
Readback is developed on the basis of Scrubbing. It refers to comparing the configuration data read back with the original configuration data and reconstructing it after errors are found. This method is used in the literature, and it also uses error correction codes to protect the configuration data. The data of each configuration frame is protected by a 12-bit see-dec Hamming code, and the identification code of each basic unit in the FPGA is different. After reading back the configuration file through ICAP (Internal Configuration Access Port), the error correction code can give the location of the error.
Scrubbing can repair functional errors caused by SEU in LUT, routing matrix and CLB without interrupting circuit operation. However, it cannot change the contents of the trigger in LUT, so it cannot reset the state of the register. When the value of the storage unit is flipped, it can only be repaired by resetting the system. However, resetting the system will interrupt the system function and seriously affect the performance of the system.
2.2 Small-grained TMR technology
With the emergence of partial dynamic reconstruction technology, a small-granularity TMR method has emerged. TMR can be implemented with a smaller granularity as the step size, using reasonable layout and routing to achieve the required resource overhead and obtain maximum reliability. The literature has conducted an experimental analysis of the fault tolerance performance of TMRs of different granularities in the presence of multiple errors. The results show that small-granularity TMR is better than TMR with the entire system as the granularity.
In the case where global TMR is not feasible (for example, resources are limited), small-granularity TMR is a better choice, which can improve the reliability of the system while using fewer resources. Since redundancy measures are not used for all modules, the implementation must focus on applying TMR technology to those modules that can relatively improve the reliability of the system. At this time, the number and location of voters is also an issue that needs to be considered. Since additional wiring is required before and after the modules using triple-module redundancy, and both logic circuits and wiring resources are sensitive to SEU, this result will reduce the reliability of the system. As shown in Figure 2, the shaded part in the figure is sensitive to SEU. It can be seen from the figure that the sensitive part in (c) is more than that in (b), which is caused by the voter and additional wiring. Therefore, it is necessary to limit the transition between triple-module redundancy circuits and circuits that do not use triple-module redundancy, so that the system reliability can be improved by concentrating on the use of triple-module redundancy technology.
[page]
In order to select the modules that need triple-module redundancy and perform reasonable layout and wiring, the errors that occur in the system are divided into persistent errors and non-persistent errors. Persistent errors are errors generated by SEU that change the internal state of the circuit; non-persistent errors are errors that can be eliminated by FPGA reconstruction, while persistent errors still exist after reconstruction.
Based on the above analysis, the priority levels for implementing some TMRs are as follows:
The first level is the part where persistent errors occur.
The second level is a circuit that causes a portion of the circuit that can produce a continuity error to fail in order to reduce the conversion criteria between TMR and non-TMR.
The third stage is the forward portion of the circuit that will generate persistent errors, also based on the principle of reducing the transition between TMR and non-TMR.
The fourth stage is independent of the portion of the circuit that produces persistent errors.
The circuit can be partitioned through static analysis. The problem here is that in the standard global TMR, all inputs, outputs and clocks are triple-module redundant, but when using partial TMR, redundancy for I/O and clock may not be achieved. Just like logic circuits without TMR, clocks and I/O without TMR can also produce undetectable errors.
From the experimental results, we can see that since this method mainly focuses on the circuit part that can produce persistent errors, when the redundant resources used increase, the probability of persistent errors decreases rapidly, and is eventually almost completely overcome. Therefore, the use of partial TMR can achieve a balance between resources and reliability, and maximize resource utilization while minimizing the impact on reliability.
Additionally, a flip may change the configuration bits in the configuration memory that control the wiring, causing a short circuit between two different redundant modules. Such a flip affects more than one module in the TMR, causing output errors. Since 90% of the configuration resources are used to control the wiring, this issue needs to be considered. The possibility of such errors also depends on the layout of the TMR, which is directly dependent on the number of majority voters. When the number of voters increases, additional connections are required between modules, so the modules must be close together, which increases the possibility that a flip will cause a short circuit between modules. In order to reduce the possibility of errors that change the wiring and affect the robustness of the TMR, the connections between modules must be minimized as much as possible. If the number of majority voters can be reduced, the connections between modules can be reduced.
The solution to this problem is to use TMRs with larger granularity to reduce the connections between them, and the voter is only used in the circuit output part. But at the same time, a new problem arises. For example, in order to repair the state of the storage unit, Xilinx proposed the XTMR method of applying the voter to the place where there is a register and adding feedback to correct the impact of the flip on the value in the register. If the internal voter is removed, the error of the storage unit in the error part will not be corrected, so after reconstruction, it will be out of sync with the working state of other modules.
The method used in the literature to solve this problem is to read the stored working status from an external memory. However, this method requires the three modules to be offline before the status can be stored and reset to achieve synchronization. Obviously, this is not feasible, especially for circuits with high real-time requirements.
To address this problem, the literature proposes a synchronization technology that uses a status register replication mechanism in the TMR system and introduces a data path between the three redundant modules to transmit the status register data. When a redundant module fails and is partially reconstructed, it can complete synchronization and resume work in a timely manner by accepting the token of the main controller and copying the status register data from the normal module. This method shortens the time from repairing the faulty module to rejoining the system, thereby reducing the probability of accumulated faults and improving the reliability of the redundant system.
Another method is to predict the state that other modules will reach the fastest, and then preset the state of the reconstructed module. In this case, you only need to stop the module to be reconstructed without affecting the work of the other two modules. When the state of the working module is synchronized with the preset state, the three modules work together again. At this time, state selection is a problem. The frequency of reaching this state must be very high and it will be reached in a relatively fast time. In addition, multiple states can improve efficiency but at the same time increase the preset signal width.
If the state of the register is unpredictable, such as register chains and adders, this method cannot be used, so this method needs to be improved to improve its adaptability. When wiring the TMR modules, try to separate them by a certain distance, which can also reduce the possibility of affecting other modules after an error in one module. 2.3 TMR technology based on the improvement of the basic unit structure of FPGA
Because TMR consumes a lot of resources, the smaller the granularity, the more resources it consumes due to wiring and voter implementation in general methods. The literature proposes a relatively novel small-granularity method, which makes some changes to the structure of SRAM-based FPGA (LUT and CLB structure) to implement TMR with small granularity but reduce resource consumption.
Xilinx Virtex-5 series chips are used in this paper. Its basic structure is a 6-input LUT implemented by two 5-input LUTs. As shown in Figure 3. If the lower five bits of input data are the same, then two 5-input LUTs can be implemented. In some cases, only a 5-input LUT is needed, and the other LUT is not used. If this situation is always met during layout, the remaining resources can be used to implement TMR.
One TMR requires two LUTs, while two TMRs only require three LUTs. Voters, error reporting circuits, and other control lines need to be implemented inside the LUT, which increases the wiring and delay inside the LUT.
The advantages of this method are reduced granularity, increased reliability, less resource consumption, and conditional reconstruction through error detection and positioning, which reduces power consumption and configuration time. Experimental results show that compared with the traditional small-granularity TMR, this method consumes only 76.5% of the additional resources, while the traditional one reaches 242%.
2.4 TMR technology based on spatial search method
Due to the emergence of partial and small-granularity TMR, the selection of granularity and circuit modules is a key issue under FPGA and design constraints. Often, only the requirements of resources, power consumption and reliability are known, and the actual layout and implementation is a thorny issue. The literature has proposed a method based on spatial search. This method provides parameters such as resources, power consumption and reliability, and searches among various possible solutions to obtain the best result.
2.5 Time-based TMR technology
The basic idea of time-based fault shielding is to perform faults through multiple calculations, that is, to repeat the same calculation twice or more times and compare the results to detect and overcome errors. When a result is obtained for a certain part of the circuit, it is temporarily stored, and then the calculation is performed again after a certain delay and the output is stored. If the comparison results are inconsistent, an error has occurred. At this time, the same delay is made and the output is output as the correct result.
This method is very effective for detecting instantaneous faults, but its fault tolerance effect is related to the delay time. This method actually prolongs the use time in exchange for saving resources, and is less practical for systems with high real-time requirements.
2.6 TMR technology based on software and hardware redundancy
For irreparable damage to the hardware, the above methods will fail. In this case, three different versions of files are used in each module (one for use and two for backup) and each module has 1/4 hardware redundancy resources. If a hardware failure occurs, it is first reconfigured with other versions. If this does not solve the problem, then the layout is re-arranged by using additional redundant resources to bypass the faulty part. However, due to the requirements for redundant resources and storage units, this method further increases resource consumption.
3. Prospects of TMR technology development
Based on the above analysis, the relationship between the problems and improvement methods of TMR technology is described by the block diagram shown in Figure 4. Due to the problem of hardware failure accumulation, a variety of new TMR-based solutions have been produced under the requirement of system reliability. However, these technologies are only proposed for certain problems. It only solves some problems and also brings some new problems. Therefore, the fault-tolerant technology based on TMR is still not mature enough.
However, the small-granularity TMR technology is a very flexible method. It can achieve better performance on the basis of saving resources when combined with other methods. The technology based on small-granularity TMR will be a major development direction of TMR technology, and the impact of the relatively increased wiring resources on system reliability needs to be further resolved. In addition, since the implementation of small-granularity TMR requires the selection and layout of various circuits in the system, the automation of TMR implementation is also a direction that needs to be studied.
4 Conclusion
The outstanding problems of TMR technology are summarized, these new methods are studied, their advantages and problems are analyzed, and corresponding solutions are pointed out. The development of TMR technology should be oriented towards efficient implementation methods and reliability, based on a robust evaluation strategy, and according to the parameter requirements to be achieved, the final TMR solution should be obtained by weighing different granularities and layouts in a highly automated manner.
Previous article:FPGA Solutions for Defibrillators
Next article:FPGA-based SoC verification platform realizes circuit simulation and debugging
Recommended ReadingLatest update time:2024-11-16 17:46
- Popular Resources
- Popular amplifiers
- Analysis and Implementation of MAC Protocol for Wireless Sensor Networks (by Yang Zhijun, Xie Xianjie, and Ding Hongwei)
- MATLAB and FPGA implementation of wireless communication
- Intelligent computing systems (Chen Yunji, Li Ling, Li Wei, Guo Qi, Du Zidong)
- Summary of non-synthesizable statements in FPGA
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- Now that the galvanometer has been repaired, what can I DIY with it?
- Questions about CSL_CGEM
- How to improve the power of voltage doubler rectifier circuit
- Mesh Wi-Fi system enhances smart home applications and makes connectivity easy
- Good tools can make work more efficient: Keysight Technology promotion season is here!
- Introduction to the SIG852 arbitrary waveform generator based on computer software (similar to a virtual oscilloscope)
- Evaluation information is here~~
- It is recommended that the forum organize the posts of the past year
- RFSOC wireless communication development platform
- Littro MicroBox——Ultra-small terminal deep learning recognition module