Application of high availability technology in number portability centralized management system

Publisher:shiwanyongbingLatest update time:2011-04-16 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere
1 Introduction

In the number portability system, CSMS (centralized management system) collects the basic data and number portability rules of all operators' number portability users across the country, plays the role of number portability service provider and arbitrator, and has a very important position. CSMS and the operator's number portability business production system provide users with number portability application services in real time, and must provide 99.99% service availability within 24h×7d×365d according to the requirements of the carrier-level system. Therefore, the high availability solution of the CSMS system is very important.

High availability generally refers to improving the availability of systems and applications by minimizing system downtime. In order to improve system availability, one method is to improve the reliability of each computer component, but this method is not reliable, because even if a single server has high reliability, there is still a potential risk of single point failure. Therefore, the more mature approach in the industry is to adopt a cluster solution. It adds redundant devices so that when a device stops providing services due to an error, these redundant devices can continue to provide services. In this article, high availability also means "fast recovery", that is, once the system is stopped and restarted, the business application can be restored as soon as possible.

This article mainly introduces the cluster technology that can be used at various levels in CSMS to achieve overall high availability of the system.

2 Application scope of system high availability technology

In the number portability system, the core data layer from the operator interface to the CSMS mainly includes the following functional layers, and the high availability solution is mainly developed around these layers.

(1) Network layer: This is the part that connects to the operator. The main considerations are how to avoid single point failures in transmission and how to avoid single point failures in network equipment.

(2) Web server layer: How to ensure that there is no single point of failure in the Web server? If multiple Web servers are provided, how to coordinate resources between them?

(3) Application server layer: After the Web server submits a request to the application server, how to avoid single point failure of the application server and resource coordination of multiple application servers?

(4) Database server layer: When an application server submits a request to a database server, how can we avoid single point failure of the database server and resource coordination between multiple servers?

(5) Application software: Even if we take various measures, there is still a possibility of server hardware failure. How should we design our application software to ensure that the system can recover quickly after the system is restarted?

(6) Data layer: How to ensure data storage is secure and reliable?

In order to answer the above questions, we need to study and summarize various high availability technologies.

3 Research on High Availability Technology

3.1 CSMS System Architecture

Figure 1 shows the CSMS system organizational structure.


Figure 1 CSMS system organizational structure

In order to ensure high availability of the system and prevent single point failures, each functional layer of the system adopts redundant configuration on the hardware equipment, and various software solutions are designed to achieve high availability of the system.

3.2 Network Solution

In terms of network solutions, the dedicated lines between the system and each operator use 155M POS or MSTP dual optical cables for access, using the redundancy and self-healing capabilities of the transmission network to ensure high availability of the system's physical access lines. The two optical cables of each operator are connected to the two access routers of the system to avoid single point failures of router equipment as much as possible. Each router is configured with multiple network cards to access the dedicated lines of multiple operators to prevent a single card failure from affecting the access of more operators.

In the design of the router-to-carrier solution, a dynamic routing protocol needs to be used. When a default configuration route from a router to a carrier fails (such as a line failure or a board failure), the alternative route needs to be broadcast to all related devices, and new communication connections are communicated according to the new route. In the design of the router-to-firewall solution, the VRRP protocol needs to be used for dynamic IP address binding, that is, the IP of the two routers connected to the firewall is a virtual address, which is bound to the actual address of a router by default. When switching is required, the virtual address is bound to the actual address of another router. For the firewall, the communication switching process is completed without making any changes.

3.3 Web Server Load Balancer Solution

After the request from the client passes through the network device, it will first reach the Web server. From the perspective of high availability design of the system, the system will deploy multiple Web servers for clustering. Clustering between Web servers includes two aspects: Web load balancing and session failover.

Load balancing can be done using a variety of technologies, such as using a hardware load balancer, or deploying load balancing software on a web server, with the web server acting as a load balancer. The main features of a load balancer include:

(1) Single point access

From the client's perspective, multiple Web servers have only one address, which is the service address of the load balancer. This has two advantages: first, the client does not need to configure multiple Web server addresses, which is more convenient; second, the address information of specific devices in the network can be shielded from the client network, which has a certain effect on network protection.

(2) Implementing a load balancing algorithm

When a client request comes in, the load balancer can decide which backend Web server to forward the request to for processing. The mainstream algorithms include: round-robin algorithm, random algorithm and weighted algorithm. Regardless of the algorithm, the load balancer always tries to make each server instance share the same pressure.

(3) Health Check

Once a web server stops working, the load balancer can detect it and stop forwarding requests to this server. Similarly, when the failed server starts working again, the load balancer can also detect it and start forwarding requests to it.

(4) Session stickiness

All web applications have some session status, such as whether a process in the number portability system has ended, whether a request message has received the corresponding ACK information or response information, etc. Because the HTTP protocol itself is stateless, the session status needs to be recorded somewhere and associated with the client so that it can be easily retrieved the next time a request is made. When performing load balancing, for a certain session, forwarding the request to the server instance it requested last time is a good choice, otherwise, the application may not work properly.

Because the session state is generally stored in the memory of a Web server instance, the "session stickiness" feature is very important for the load balancer. However, if a Web server fails for some reason, the session state on this server will be lost. The load balancer can detect this error and no longer forward requests to this server, but other errors may occur due to the loss of session state. Therefore, the load balancer must also have another important function "session failover".

(5) Session Failure Transfer

The implementation mechanism of session failover is that after a Web server receives a request from a client, it backs up the session object to a certain place to ensure that the session state is not lost when the server fails.

There are different solutions for backing up session data. The more mainstream solutions include database solutions and memory replication solutions.

The database solution is to let the Web server store session data in the database at the appropriate time. When a failover occurs, another available Web server instance takes over the failed server and restores the session state from the database. The advantages of the database solution are:

●Easy to implement. Separating request processing from session backup makes the cluster more robust and easier to manage.

●Even if the entire cluster fails, session data can still be saved and can be used when the system restarts.

The disadvantage of database transactions is that they consume more resources and their performance will be limited when the amount of data in the session is large.

The memory replication solution saves session information in the memory of the backup server instead of persisting it in the database. Compared with the database solution, this solution has higher performance, and the network communication overhead between the original server and the backup server is very small. This solution saves the stage of "restoring" session data because the session information is already in the memory of the backup server.

3.4 Application Server Based on J2EE Solution

Before introducing the cluster solution of application servers, it is necessary to introduce J2EE, because J2EE has become a de facto standard for distributed enterprise-level application development and deployment. The cluster solution of application servers is actually implemented based on certain J2EE standards.

In J2EE, business logic is encapsulated into reusable components. Components run in component containers of distributed servers. Containers communicate through relevant protocols to implement mutual calls between components. Therefore, the communication process between clients or web servers and application servers on the network that we see is a call between components or a call from a component to a container service in J2EE implementation. This call is divided into two stages in the J2EE specification: one is to access the JNDI server to obtain the proxy (EJB Stub) of the EJB component to be called, and the other is to call the EJB component.

The cluster solutions for JNDI access are divided into shared global JNDI tree solution, independent JNDI solution and central JNDI solution with high availability. Each solution can achieve high availability of JNDI service.

In the call phase of the EJB component, the client can actually only call a local object called "Stub". This local "Stub" has the same interface as the remote EJB and acts as a proxy. Stub knows how to find the real object on the network through the RMI/IIOP protocol. There are mainly three ways to cluster solutions in the process of calling EJB Stub:

●Smart Stub: Special behaviors are added to the Stub code, but these codes are transparent to the client (the client program knows nothing about these codes). These codes contain a list of accessible target servers and can detect the failure of the target server. They also contain very complex load balancing and failure transfer logic to distribute requests.

●IIOP runtime: The logic of load balancing and failover is integrated into the IIOP runtime, which makes the Stub very small and does not involve other codes.

●LSD (LocatiON Service Daemon): LSD acts as a proxy for EJB clients. In this solution, the EJB client obtains a Stub by searching JNDI. The routing information contained in this Stub points to LSD, rather than the application server that actually owns the EJB. Therefore, after LSD receives the client's request, it distributes the request to different application server instances based on its load balancing and failover logic.

3.5 Database Server Solution

There are two general methods for clustering database servers: one is based on the cluster software provided by the operating system, such as various HA software; the other is the cluster software provided by the database software itself.

3.5.1 HA Software

The working process of HA software is roughly as follows:

(1) In an HA network environment, the network is divided into a TCP/IP network and a non-TCP/IP network. The TCP/IP network is the public network for communication between application clients and servers. The non-TCP/IP network is the private network of the HA software. The simplest one can be a "Heart-Beat" line. HA technology uses the private network to monitor each node in the HA environment instead of the TCP/IP communication path.

(2) On an HA network, the TCP/IP network and non-TCP/IP network on each node will continuously send and receive Keep-Alive messages. Once a certain number of packets sent to a certain HA node are lost, it can be confirmed that the other node has failed. When the main network card (Service Adapter) of a node fails, the HA agent of the node will switch the network card, transfer the IP address of the original Service Adapter to the new Standby Adapter, and transfer the Standby address to the failed network card. At the same time, the ARP of other nodes on the network will be refreshed, thus achieving the reliability guarantee of the network card.

(3) If all KAs on the TCP/IP network and non-TCP/IP network are lost, the HA software determines that the node has failed and generates resource takeover, that is, the resources on the shared disk display are taken over by the backup node; at the same time, IP address takeover occurs, that is, the HA software transfers the Service IP Address of the failed node to the backup node, so that the Client on the network still uses this IP address. Similarly, application takeover occurs, and the application automatically restarts on the takeover node, so that the system can continue to provide external services.

3.5.2 Database Cluster Software

We take ORACLE's Real Application Cluster (RAC) software as an example to introduce the main features of database cluster software.

(1) Shared disk

The main difference from the storage method of Single-Instance Oracle is that RAC storage must store all data files in RAC in a shared device so that instances accessing the same database can share. At the same time, in order to enable each instance to operate independently and to enable other instances to find related operation traces when the system is restored, the storage structure of RAC database and single-instance database also has the following differences:

(1) Each Instance has its own SGA (System Global Area).

(2) Each Instance has its own Background Process.

(3) Each Instance has its own Redo Logs.

(4) Each Instance has its own Undo tablespace.

RAC also cannot use traditional file systems because traditional file systems do not support parallel mounting of multiple systems. Files must be stored in raw devices without any file system or in file systems that support concurrent access by multiple systems.

RAC operations require synchronization of access to shared resources in all instances. RAC uses the Global Resource Directory to record resource usage information in the Cluster Database, and the Global Cache Service (GCS) and Global Enqueue Service (GES) manage information in the GRD. After each instance performs a read or write operation, GCS or GES must synchronize it to the buffer of other instances according to a strict process.

(2) Cache Fusion

In a RAC environment, the memory structure and background processes of each instance are the same, and they look like a single system. Each instance has a buffer in the SGA, and using Cache Fusion technology, each instance uses the cache of the cluster instance to process the database as if it were a single cache. Cache Fusion technology can minimize disk I/O and optimize data reading and writing. There will be considerable network communication and CPU overhead between nodes, so the performance of a dual-node RAC will not be twice that of a single-node.

(3) Transparent application switching

When a node in the RAC cluster fails, all transactions saved in memory on the failed node will be lost, and Oracle will transfer the control of the data blocks owned by the failed node back to the normal node. This process is called global cache service reset. When the global cache service reset occurs, all servers in the RAC will be frozen, all applications will be suspended, and GCS will not respond to requests from any node in the cluster; after the reset, Oracle reads the log records, determines and locks the pages that need to be restored, and performs a rollback. At this time, the database is available for recovery.

3.6 System recovery plan for application software

Even if we have taken all the previous measures, we still need to consider what to do in the event that the previous solution fails, that is, an error occurs in the underlying software or hardware of the system, causing the system to restart.

Before the system is restarted, there are several processes running in the system, each of which is in a different state. The recovery plan of the application software is to ensure that these states can be restored and automatically run to the end state after the system is restarted. To this end, during the operation of the system, the status of all messages and processes needs to be saved in the database when modified, rather than just saved in the memory. When the system is recovered, it is necessary to check all messages and processes in the database that have not reached the final state and perform subsequent processing.

The CSMS implementation process after System Recover is as follows:

(1) Restore all messages: Restore messages sent by CSMS and restore messages received by CSMS.

(2) Resume the application process.

(3) Resume the deregistration process.

(4) Resume shutdown-related processes.

(5) Resume the audit process.

(6) Check the effective broadcast for the day.

(7) Check the synchronization for the day.

(8) Check the synchronization for the current month.

The key to system recovery is to understand the different states of each process. For example, in message recovery, for NP messages sent from CSMS, the states include:

●Init (initial).

●Sending: The message has been sent to SOA/LSMS and is waiting for ACK.

●Wait Send: ACK times out and resends.

●Sent (sent successfully): ACK message received.

Complete: A reply (response/confirmation) to the NP message (request/indication) has been received and the corresponding ACK has been sent successfully.

For NP messages received by CSMS, the status includes:

●Init (initial).

●Processing: Indicates that the system is processing the NP message, which mainly includes saving the NP message into the system and selecting the processing method according to the type of the NP message.

●Processed: Indicates that the system has completed processing the NP message.

●Replying (sending reply message): The system has sent the organized NP reply message to SOA/LSMS, but no ACK has been received for the message.

●Wait Reply: ACK times out and waits for retransmission.

●Complete: The system receives the ACK information of the message.

For other system recovery processes, the methods are similar and will not be repeated here.

3.7 Disk Array RAID and Tape Library Backup Solutions

The final consideration for high system reliability is the storage device. With current technology, an effective storage solution can not only ensure the security and reliability of stored data, but also increase the speed of hard disk reading and writing. The commonly used technology is RAID.

RAID technology can be divided into RAID0, RAID1, RAID5, etc. according to the level. Different levels of RAID have different storage efficiency, and the time it takes to recover when a hard disk fails is also different. For specific technologies, please refer to relevant technical documents.

In order to further enhance the protection of data storage, the system generally has other media backup solutions, such as tape library backup. The data of the disk array is backed up to the tape library according to certain rules, which can increase the capacity of the storage device and add another layer of data protection.

4 Conclusion

As one of the important performance indicators of the number portability centralized management system, high availability is of great significance. Because high availability needs to take into account all aspects of the system, it is relatively complex. Especially today when various new IT technologies emerge in an endless stream, studying various high availability technologies and selecting appropriate high availability technology solutions should be the focus of research for system architecture designers and related technical researchers. This article is only a starting point for discussion, and it briefly analyzes and summarizes various high availability technologies of the number portability centralized management system. I believe that these high availability technologies have certain reference significance for the design of similar systems.

Reference address:Application of high availability technology in number portability centralized management system

Previous article:TP based on ZigBee wireless network technology and wireless transceiver chip CC1100
Next article:Functions and networking structure of high-level switches

Latest Analog Electronics Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号