In the number portability system, CSMS (centralized management system) collects the basic data and number portability rules of all operators' number portability users across the country, plays the role of number portability service provider and arbitrator, and has a very important position. CSMS and the operator's number portability business production system provide users with number portability application services in real time, and must provide 99.99% service availability within 24h×7d×365d according to the requirements of the carrier-level system. Therefore, the high availability solution of the CSMS system is very important.
High availability generally refers to improving the availability of systems and applications by minimizing system downtime. In order to improve system availability, one method is to improve the reliability of each computer component, but this method is not reliable, because even if a single server has high reliability, there is still a potential risk of single point failure. Therefore, the more mature approach in the industry is to adopt a cluster solution. It adds redundant devices so that when a device stops providing services due to an error, these redundant devices can continue to provide services. In this article, high availability also means "fast recovery", that is, once the system is stopped and restarted, the business application can be restored as soon as possible.
This article mainly introduces the cluster technology that can be used at various levels in CSMS to achieve overall high availability of the system.
2 Application scope of system high availability technology
In the number portability system, the core data layer from the operator interface to the CSMS mainly includes the following functional layers, and the high availability solution is mainly developed around these layers.
(1) Network layer: This is the part that connects to the operator. The main considerations are how to avoid single point failures in transmission and how to avoid single point failures in network equipment.
(2) Web server layer: How to ensure that there is no single point of failure in the Web server? If multiple Web servers are provided, how to coordinate resources between them?
(3) Application server layer: After the Web server submits a request to the application server, how to avoid single point failure of the application server and resource coordination of multiple application servers?
(4) Database server layer: When an application server submits a request to a database server, how can we avoid single point failure of the database server and resource coordination between multiple servers?
(5) Application software: Even if we take various measures, there is still a possibility of server hardware failure. How should we design our application software to ensure that the system can recover quickly after the system is restarted?
(6) Data layer: How to ensure data storage is secure and reliable?
In order to answer the above questions, we need to study and summarize various high availability technologies.
3 Research on High Availability Technology
3.1 CSMS System Architecture
Figure 1 shows the CSMS system organizational structure.
Figure 1 CSMS system organizational structure
In order to ensure high availability of the system and prevent single point failures, each functional layer of the system adopts redundant configuration on the hardware equipment, and various software solutions are designed to achieve high availability of the system.
3.2 Network Solution
In terms of network solutions, the dedicated lines between the system and each operator use 155M POS or MSTP dual optical cables for access, using the redundancy and self-healing capabilities of the transmission network to ensure high availability of the system's physical access lines. The two optical cables of each operator are connected to the two access routers of the system to avoid single point failures of router equipment as much as possible. Each router is configured with multiple network cards to access the dedicated lines of multiple operators to prevent a single card failure from affecting the access of more operators.
In the design of the router-to-carrier solution, a dynamic routing protocol needs to be used. When a default configuration route from a router to a carrier fails (such as a line failure or a board failure), the alternative route needs to be broadcast to all related devices, and new communication connections are communicated according to the new route. In the design of the router-to-firewall solution, the VRRP protocol needs to be used for dynamic IP address binding, that is, the IP of the two routers connected to the firewall is a virtual address, which is bound to the actual address of a router by default. When switching is required, the virtual address is bound to the actual address of another router. For the firewall, the communication switching process is completed without making any changes.
3.3 Web Server Load Balancer Solution
After the request from the client passes through the network device, it will first reach the Web server. From the perspective of high availability design of the system, the system will deploy multiple Web servers for clustering. Clustering between Web servers includes two aspects: Web load balancing and session failover.
Load balancing can be done using a variety of technologies, such as using a hardware load balancer, or deploying load balancing software on a web server, with the web server acting as a load balancer. The main features of a load balancer include:
(1) Single point access
From the client's perspective, multiple Web servers have only one address, which is the service address of the load balancer. This has two advantages: first, the client does not need to configure multiple Web server addresses, which is more convenient; second, the address information of specific devices in the network can be shielded from the client network, which has a certain effect on network protection.
(2) Implementing a load balancing algorithm
When a client request comes in, the load balancer can decide which backend Web server to forward the request to for processing. The mainstream algorithms include: round-robin algorithm, random algorithm and weighted algorithm. Regardless of the algorithm, the load balancer always tries to make each server instance share the same pressure.
(3) Health Check
Once a web server stops working, the load balancer can detect it and stop forwarding requests to this server. Similarly, when the failed server starts working again, the load balancer can also detect it and start forwarding requests to it.
(4) Session stickiness
All web applications have some session status, such as whether a process in the number portability system has ended, whether a request message has received the corresponding ACK information or response information, etc. Because the HTTP protocol itself is stateless, the session status needs to be recorded somewhere and associated with the client so that it can be easily retrieved the next time a request is made. When performing load balancing, for a certain session, forwarding the request to the server instance it requested last time is a good choice, otherwise, the application may not work properly.
Because the session state is generally stored in the memory of a Web server instance, the "session stickiness" feature is very important for the load balancer. However, if a Web server fails for some reason, the session state on this server will be lost. The load balancer can detect this error and no longer forward requests to this server, but other errors may occur due to the loss of session state. Therefore, the load balancer must also have another important function "session failover".
(5) Session Failure Transfer
The implementation mechanism of session failover is that after a Web server receives a request from a client, it backs up the session object to a certain place to ensure that the session state is not lost when the server fails.
There are different solutions for backing up session data. The more mainstream solutions include database solutions and memory replication solutions.
The database solution is to let the Web server store session data in the database at the appropriate time. When a failover occurs, another available Web server instance takes over the failed server and restores the session state from the database. The advantages of the database solution are:
●Easy to implement. Separating request processing from session backup makes the cluster more robust and easier to manage.
●Even if the entire cluster fails, session data can still be saved and can be used when the system restarts.
The disadvantage of database transactions is that they consume more resources and their performance will be limited when the amount of data in the session is large.
The memory replication solution saves session information in the memory of the backup server instead of persisting it in the database. Compared with the database solution, this solution has higher performance, and the network communication overhead between the original server and the backup server is very small. This solution saves the stage of "restoring" session data because the session information is already in the memory of the backup server.
3.4 Application Server Based on J2EE Solution
Before introducing the cluster solution of application servers, it is necessary to introduce J2EE, because J2EE has become a de facto standard for distributed enterprise-level application development and deployment. The cluster solution of application servers is actually implemented based on certain J2EE standards.
In J2EE, business logic is encapsulated into reusable components. Components run in component containers of distributed servers. Containers communicate through relevant protocols to implement mutual calls between components. Therefore, the communication process between clients or web servers and application servers on the network that we see is a call between components or a call from a component to a container service in J2EE implementation. This call is divided into two stages in the J2EE specification: one is to access the JNDI server to obtain the proxy (EJB Stub) of the EJB component to be called, and the other is to call the EJB component.
The cluster solutions for JNDI access are divided into shared global JNDI tree solution, independent JNDI solution and central JNDI solution with high availability. Each solution can achieve high availability of JNDI service.
In the call phase of the EJB component, the client can actually only call a local object called "Stub". This local "Stub" has the same interface as the remote EJB and acts as a proxy. Stub knows how to find the real object on the network through the RMI/IIOP protocol. There are mainly three ways to cluster solutions in the process of calling EJB Stub:
●Smart Stub: Special behaviors are added to the Stub code, but these codes are transparent to the client (the client program knows nothing about these codes). These codes contain a list of accessible target servers and can detect the failure of the target server. They also contain very complex load balancing and failure transfer logic to distribute requests.
●IIOP runtime: The logic of load balancing and failover is integrated into the IIOP runtime, which makes the Stub very small and does not involve other codes.
●LSD (LocatiON Service Daemon): LSD acts as a proxy for EJB clients. In this solution, the EJB client obtains a Stub by searching JNDI. The routing information contained in this Stub points to LSD, rather than the application server that actually owns the EJB. Therefore, after LSD receives the client's request, it distributes the request to different application server instances based on its load balancing and failover logic.
3.5 Database Server Solution
There are two general methods for clustering database servers: one is based on the cluster software provided by the operating system, such as various HA software; the other is the cluster software provided by the database software itself.
3.5.1 HA Software
The working process of HA software is roughly as follows:
(1) In an HA network environment, the network is divided into a TCP/IP network and a non-TCP/IP network. The TCP/IP network is the public network for communication between application clients and servers. The non-TCP/IP network is the private network of the HA software. The simplest one can be a "Heart-Beat" line. HA technology uses the private network to monitor each node in the HA environment instead of the TCP/IP communication path.
(2) On an HA network, the TCP/IP network and non-TCP/IP network on each node will continuously send and receive Keep-Alive messages. Once a certain number of packets sent to a certain HA node are lost, it can be confirmed that the other node has failed. When the main network card (Service Adapter) of a node fails, the HA agent of the node will switch the network card, transfer the IP address of the original Service Adapter to the new Standby Adapter, and transfer the Standby address to the failed network card. At the same time, the ARP of other nodes on the network will be refreshed, thus achieving the reliability guarantee of the network card.
(3) If all KAs on the TCP/IP network and non-TCP/IP network are lost, the HA software determines that the node has failed and generates resource takeover, that is, the resources on the shared disk display are taken over by the backup node; at the same time, IP address takeover occurs, that is, the HA software transfers the Service IP Address of the failed node to the backup node, so that the Client on the network still uses this IP address. Similarly, application takeover occurs, and the application automatically restarts on the takeover node, so that the system can continue to provide external services.
3.5.2 Database Cluster Software
We take ORACLE's Real Application Cluster (RAC) software as an example to introduce the main features of database cluster software.
(1) Shared disk
The main difference from the storage method of Single-Instance Oracle is that RAC storage must store all data files in RAC in a shared device so that instances accessing the same database can share. At the same time, in order to enable each instance to operate independently and to enable other instances to find related operation traces when the system is restored, the storage structure of RAC database and single-instance database also has the following differences:
(1) Each Instance has its own SGA (System Global Area).
(2) Each Instance has its own Background Process.
(3) Each Instance has its own Redo Logs.
(4) Each Instance has its own Undo tablespace.
RAC also cannot use traditional file systems because traditional file systems do not support parallel mounting of multiple systems. Files must be stored in raw devices without any file system or in file systems that support concurrent access by multiple systems.
RAC operations require synchronization of access to shared resources in all instances. RAC uses the Global Resource Directory to record resource usage information in the Cluster Database, and the Global Cache Service (GCS) and Global Enqueue Service (GES) manage information in the GRD. After each instance performs a read or write operation, GCS or GES must synchronize it to the buffer of other instances according to a strict process.
(2) Cache Fusion
In a RAC environment, the memory structure and background processes of each instance are the same, and they look like a single system. Each instance has a buffer in the SGA, and using Cache Fusion technology, each instance uses the cache of the cluster instance to process the database as if it were a single cache. Cache Fusion technology can minimize disk I/O and optimize data reading and writing. There will be considerable network communication and CPU overhead between nodes, so the performance of a dual-node RAC will not be twice that of a single-node.
(3) Transparent application switching
When a node in the RAC cluster fails, all transactions saved in memory on the failed node will be lost, and Oracle will transfer the control of the data blocks owned by the failed node back to the normal node. This process is called global cache service reset. When the global cache service reset occurs, all servers in the RAC will be frozen, all applications will be suspended, and GCS will not respond to requests from any node in the cluster; after the reset, Oracle reads the log records, determines and locks the pages that need to be restored, and performs a rollback. At this time, the database is available for recovery.
3.6 System recovery plan for application software
Even if we have taken all the previous measures, we still need to consider what to do in the event that the previous solution fails, that is, an error occurs in the underlying software or hardware of the system, causing the system to restart.
Before the system is restarted, there are several processes running in the system, each of which is in a different state. The recovery plan of the application software is to ensure that these states can be restored and automatically run to the end state after the system is restarted. To this end, during the operation of the system, the status of all messages and processes needs to be saved in the database when modified, rather than just saved in the memory. When the system is recovered, it is necessary to check all messages and processes in the database that have not reached the final state and perform subsequent processing.
The CSMS implementation process after System Recover is as follows:
(1) Restore all messages: Restore messages sent by CSMS and restore messages received by CSMS.
(2) Resume the application process.
(3) Resume the deregistration process.
(4) Resume shutdown-related processes.
(5) Resume the audit process.
(6) Check the effective broadcast for the day.
(7) Check the synchronization for the day.
(8) Check the synchronization for the current month.
The key to system recovery is to understand the different states of each process. For example, in message recovery, for NP messages sent from CSMS, the states include:
●Init (initial).
●Sending: The message has been sent to SOA/LSMS and is waiting for ACK.
●Wait Send: ACK times out and resends.
●Sent (sent successfully): ACK message received.
Complete: A reply (response/confirmation) to the NP message (request/indication) has been received and the corresponding ACK has been sent successfully.
For NP messages received by CSMS, the status includes:
●Init (initial).
●Processing: Indicates that the system is processing the NP message, which mainly includes saving the NP message into the system and selecting the processing method according to the type of the NP message.
●Processed: Indicates that the system has completed processing the NP message.
●Replying (sending reply message): The system has sent the organized NP reply message to SOA/LSMS, but no ACK has been received for the message.
●Wait Reply: ACK times out and waits for retransmission.
●Complete: The system receives the ACK information of the message.
For other system recovery processes, the methods are similar and will not be repeated here.
3.7 Disk Array RAID and Tape Library Backup Solutions
The final consideration for high system reliability is the storage device. With current technology, an effective storage solution can not only ensure the security and reliability of stored data, but also increase the speed of hard disk reading and writing. The commonly used technology is RAID.
RAID technology can be divided into RAID0, RAID1, RAID5, etc. according to the level. Different levels of RAID have different storage efficiency, and the time it takes to recover when a hard disk fails is also different. For specific technologies, please refer to relevant technical documents.
In order to further enhance the protection of data storage, the system generally has other media backup solutions, such as tape library backup. The data of the disk array is backed up to the tape library according to certain rules, which can increase the capacity of the storage device and add another layer of data protection.
4 Conclusion
As one of the important performance indicators of the number portability centralized management system, high availability is of great significance. Because high availability needs to take into account all aspects of the system, it is relatively complex. Especially today when various new IT technologies emerge in an endless stream, studying various high availability technologies and selecting appropriate high availability technology solutions should be the focus of research for system architecture designers and related technical researchers. This article is only a starting point for discussion, and it briefly analyzes and summarizes various high availability technologies of the number portability centralized management system. I believe that these high availability technologies have certain reference significance for the design of similar systems.
Previous article:TP based on ZigBee wireless network technology and wireless transceiver chip CC1100
Next article:Functions and networking structure of high-level switches
- Popular Resources
- Popular amplifiers
- High signal-to-noise ratio MEMS microphone drives artificial intelligence interaction
- Advantages of using a differential-to-single-ended RF amplifier in a transmit signal chain design
- ON Semiconductor CEO Appears at Munich Electronica Show and Launches Treo Platform
- ON Semiconductor Launches Industry-Leading Analog and Mixed-Signal Platform
- Analog Devices ADAQ7767-1 μModule DAQ Solution for Rapid Development of Precision Data Acquisition Systems Now Available at Mouser
- Domestic high-precision, high-speed ADC chips are on the rise
- Microcontrollers that combine Hi-Fi, intelligence and USB multi-channel features – ushering in a new era of digital audio
- Using capacitive PGA, Naxin Micro launches high-precision multi-channel 24/16-bit Δ-Σ ADC
- Fully Differential Amplifier Provides High Voltage, Low Noise Signals for Precision Data Acquisition Signal Chain
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- 【Running posture training shoes】No.005-Posture recognition and step frequency calculation
- 【TMS320 frequency measurement】Newbie's second post! TMS320F28379D Launchpad environment configuration
- CC1310 switching rate method
- Experience in using arrays, strnpy functions, and atoi functions in DSP projects
- Continuous integration plus self-shielding-5G RF focus
- The location of the essence chapter affects its use
- Small base stations help 5G connections and fill wireless coverage "gaps"!
- [Zhongke Bluexun AB32VG1 RISC-V board "run into" RTT evaluation] ADC
- dxp
- How to timestamp ZigBee mac layer data