Brief Analysis of Real-time Monitoring System of UPS Power Supply-EEWORLD

Collect

　　Among the indicators for measuring the safety performance of UPS systems, two indicators are particularly important: one is the reliability of the system, and the other is availability. As the main equipment for improving the quality of the power system, the reliability and availability of the UPS system itself are the most important and fundamental indicators for measuring the performance of the UPS system. Here, the factors affecting the availability of UPS are analyzed in detail, and an effective method for improving the availability of the system by adopting advanced UPS intelligent management technology is obtained. New UPS management technology and products are of great significance to improving the availability of UPS systems.

　　From the definition of system availability, we can see that there are two ways to improve UPS system availability: one is to improve system reliability, that is, to extend the mean time between failures (MTBF), and the other is to reduce the mean time to repair (MTTR). From the relationship between the mean time to repair (MTTR) of a UPS system and the availability of a UPS system, we can see that shortening the mean time to repair (MTTR) has a more significant effect on improving system availability.

　　Here, we will analyze the composition of the mean time to repair (MTTR) in detail through a specific case. The case analyzed is an 80kVA UPS system. If such a UPS system fails, it usually requires professional technicians from the manufacturer to repair it. For such a system, many manufacturers have made service commitments such as "4-hour response" and "24-hour repair". But it is worth noting that these times are not the real fault recovery time. First of all, the so-called "4-hour response" usually only refers to the time from the manufacturer's engineers receiving the user's notification to making a door-to-door repair plan, which is still a long way from the real fault repair, while "24-hour repair" will have many additional conditions, such as whether there are engineers and spare parts at the location of the failed equipment. In fact, the real fault repair time is closely related to every link in the entire fault repair process.

　　When the repair time of the above UPS system failure case is further analyzed in detail according to the actual segmentation, it is found that the repair time of a failure is composed of the following time periods:

　　Fault alarm notification time: The time from when a fault occurs to when the user discovers the fault is represented by T1.

　　Manufacturer response time: The time from when the user feeds back the fault information to the manufacturer's after-sales service department to when the manufacturer's after-sales service engineer communicates with the user and makes a door-to-door repair plan, represented by T2.

　　Initial fault judgment time: The time it takes for the manufacturer's after-sales service engineer to communicate with the user through telephone or other means to understand the fault phenomenon and process and make a basic judgment on the fault, represented by T3.

　　On-site service time: The time from when the manufacturer's after-sales service engineer communicates with the user by phone or other means and makes a basic judgment on the fault to the time when the on-site service is provided is represented by T4.

　　Troubleshooting time: The time from when the manufacturer's after-sales service engineer comes to provide service to when the fault is solved is represented by T5.

　　1. First, let’s analyze the first period of time – fault alarm notification time T1

　　It seems that this period of time should be very short, but in fact it is extremely uncertain. First of all, since medium and large capacity UPS are generally installed in a dedicated power supply room, due to noise, safety and other reasons, the power supply room is usually unattended. Therefore, if a UPS fails, it is often not discovered by the user until the failure has serious consequences. At the same time, since the UPS system is a strong power device, it requires personnel with professional knowledge and special training to perform daily maintenance operations. Therefore, after a failure occurs, professional personnel are also required to go to the site for evaluation and judgment before the corresponding operation can be performed. This factor also restricts the speed of fault notification. It is precisely because of the above reasons, coupled with the uncertainty of spatial distance and professional knowledge, that the fault notification time T1 of the UPS becomes very uncertain, making it an important factor in reducing system availability. There is such a specific actual case. A bank data center in Tianjin uses a 125kVA UPS to power the data center. The UPS system is installed on the second floor underground of the data center and is usually unattended. At 10 o'clock in the morning one day, the UPS system suddenly had a short power outage of 10s, causing the entire data center to be paralyzed. The engineer found that the UPS did not have any hardware failures, but was running in bypass mode when the failure occurred. After checking the UPS operation history, it was found that the mains power had a short power outage of 10 seconds at that time. Since the UPS was running in bypass mode, it was equivalent to the mains power supplying power directly to the load, so the mains power outage directly affected the load. However, further inspection found that the UPS was actually in bypass mode two days ago. The reason was that the large-capacity load was overloaded and locked in bypass mode (UPS setting operation mode). Although the UPS had issued an audible alarm signal at that time, due to the distance, the staff did not hear the alarm sound, so they did not find out until serious consequences occurred. From this case, we can see that the fault notification time T1, which is usually considered unimportant, was as long as two days. Due to the large uncertainty, it actually has a great impact on MTTR, and it may be an important reason for the reduced availability of the UPS system.

　　2. Let’s look at the second period of time - manufacturer’s response time T2

　　Since the maintenance of medium and large capacity UPS requires professional knowledge and skills, it is usually completed by the manufacturer's technical staff. The length of this period reflects the manufacturer's emphasis on and ability to provide after-sales service. Different manufacturers provide 5×8 (5 days a week, 8 hours a day within the statutory working hours) and 7×24 (7 days a week, 24 hours a day) after-sales service response for different products.

　　3. Let’s look at the third period of time - initial fault judgment time T3

　　In order to speed up the repair of faults, the manufacturer's after-sales service engineers usually need to communicate with users through telephone and other communication means before providing on-site repair services to understand the fault phenomenon and obtain the fault status and related information of the UPS system through the user. This work is very important. The preliminary judgment of the fault plays a guiding role in preparing for the next on-site repair of the fault. The length of this period is related to many factors, including: the user's maintenance level and the operating status of the system before the fault, the technical ability and communication ability of the after-sales service engineer, the convenience of product intelligent management and use, and whether it is humanized. For example, the more the user knows about the UPS system and the higher the technical level of the user's operation and maintenance personnel, the shorter the initial fault judgment time. In addition to the technical ability of users and after-sales service engineers, which has a great influence on T3, non-technical factors such as communication ability often become important factors in determining the length of T3. The differences in non-objective factors such as dialects, language expression habits and even personality between users and after-sales service engineers and the communication skills of after-sales service engineers will have a direct impact on the effectiveness of communication, thereby affecting the length of T3.

　　4. Let’s look at the fourth period of time – door-to-door service time T4

　　The time it takes for a manufacturer's engineer to provide on-site service is affected by factors such as spatial distance, weather conditions, and traffic conditions, but it is relatively easy to control and can be treated as a relatively stable parameter when conducting MTTR analysis.

　　5. Finally, let’s look at the fifth period of time – troubleshooting time T5

　　In addition to being related to the technical level of after-sales service engineers, this period of time is also directly affected by the results of the preliminary fault judgment in the third step. Due to errors in the preliminary fault judgment, the spare parts brought to the site may not meet the maintenance needs, so that the fault cannot be repaired quickly. In addition, the structural design of the UPS system will also have a great impact on the troubleshooting time. For example, some manufacturers' UPS adopts a modular design, which greatly shortens the replacement time of faulty parts. Some manufacturers also use the so-called "N+1" modular plus redundant configuration technology, which greatly shortens the fault repair time T5.

　　In summary, in all stages that affect the fault repair time, in addition to the manufacturer's service standards and the technical level of engineers, which have an important impact on the fault repair time, the fault alarm notification, preliminary fault judgment and other links are easily affected by many uncertain factors and have great uncertainty. At the same time, they are not taken seriously by everyone, so they often become the main reason for extending the fault repair time MTTR. In order to effectively shorten T1 (fault alarm notification time), T3 (preliminary fault judgment time) and T5 (fault elimination time), first of all, the UPS system must have the function of remote fault alarm. When a fault occurs, the UPS system can report the fault information to the system operation and maintenance personnel who are not on site in time through various effective remote alarm means. Secondly, after-sales service engineers can understand the fault situation through direct and objective means, so as to obtain correct and complete information about the fault, avoiding information distortion and omissions caused by human factors.

　　In order to make the UPS system have new functions such as remote alarm, remote testing, remote fault diagnosis and remote repair, it is necessary to use new power management technologies (including a series of accessories and software products) to achieve this. The following further introduces the fault repair process after adopting these power management technologies. It is not difficult to see that power management technology is having a profound impact on the availability of UPS systems.

　　The UPS system is equipped with a new remote alarm management card, which the system administrator can set. After the system administrator sets it, the remote alarm management card can automatically detect the UPS regularly according to the system administrator's settings. When the remote alarm management card detects a potential problem or failure of the system, it will immediately and automatically send an alarm notification to the operation and maintenance personnel through telephone, paging, network email, mobile phone text messages, etc., to avoid the occurrence of failures or promptly notify the manufacturer's after-sales service department of the failure alarm, thereby shortening the alarm time T1 to "minutes". After receiving the alarm notification, the UPS system maintenance personnel will immediately notify the manufacturer's after-sales service personnel. The manufacturer's after-sales service engineers can directly access the faulty UPS, remotely detect and remotely diagnose the fault, and download UPS operating parameters and operating history records through the telephone network and the Internet. All of this is done directly by the after-sales service engineers without the participation of users, avoiding interference from human factors, making the initial judgment of the fault more accurate, which can greatly shorten the initial fault judgment time T3 and also lay the foundation for shortening the troubleshooting time T5. After determining the fault situation, after-sales service engineers can handle it according to the situation. If the fault is only due to improper setting of some system parameters, the fault can be eliminated by remotely adjusting the corresponding parameters of the UPS system. If on-site troubleshooting is required, engineers can directly carry spare parts for on-site repair. Since the initial judgment of the fault is relatively accurate, the troubleshooting time T5 is also shortened accordingly. The entire mean fault recovery time MTTR is greatly shortened, which can significantly improve the availability of the system.

Reference address：Brief Analysis of Real-time Monitoring System of UPS Power Supply

Previous article：Analysis of the performance and selection points of power frequency and high frequency UPS
Next article：The impact of UPS power supply on power grid and its elimination

Popular Resources
Popular amplifiers