The purely domestic Wanka cluster has developed a large model with trillions of parameters, and this central enterprise was the first to do so!
Jin Lei from Aofei Temple
Quantum Bit | Public Account QbitAI
The first trillion-parameter model trained by the Wanka cluster was unlocked by a central enterprise .
Specifically, it was the China Telecom Artificial Intelligence Research Institute (TeleAI) that explored this path for purely domestic artificial intelligence , led by Professor Li Xuelong , CTO, Chief Scientist, and Dean of China Telecom Artificial Intelligence Research Institute .
It is understood that the Wanka cluster used in training is provided by Tianyi Cloud Shanghai Lingang's domestically produced Wanka computing power pool, and is based on Tianyi Cloud's self-developed "Xiran Integrated Intelligent Computing Service Platform" and the telecommunications artificial intelligence company's self-developed "Xinghai AI Platform". Support, it can achieve stable training of trillions of parameters, with an average of only 1.5 training interruptions per week, and the cluster training stability has reached the international leading level.
Based on this, TeleAI has also open-sourced the Star Semantic Model TeleChat2-115B , a large model with hundreds of billions of parameters trained by a domestic deep learning framework .
TeleChat is the first open source series of semantic models in central enterprises. TeleChat2-115B, based on TeleChat, has achieved further improvement in results through optimization of training data volume, data quality and ratio, model architecture and other dimensions!
In the September C-Eval Open Access model comprehensive ranking, TeleChat2-115B scored 86.9 points and took the first place in one fell swoop!
This is not the first time that TeleAI has topped an authoritative list. As early as May this year, the logical reasoning ability of its TeleChat series of models ranked first among open source large models in the OpenCompass test list.
Specifically in terms of application, the Xingchen semantic big model is based on the "outline writing + text writing" model in terms of long text writing , which is closer to user habits.
It is understood that it also generates text paragraph by paragraph, which is conducive to the writing of very long articles.
Even in the face of extremely long meetings , the Xingchen semantic big model can easily realize the real-time generation of minutes, presenting high quality in terms of accuracy, completeness, illusion problems, logic, and standardization.
For large electronic reports , the Xingchen semantic model also supports report generation, report query, report summary, and stylized imitation of the corresponding report.
It can easily hold millions of rows of data!
How to cultivate Wanka Wancan?
One thing that needs to be made clear is that achieving 10,000 cards and 10,000 ginsengs is not an easy task. The difficulty of achieving national production alone is obvious.
The first difficulty is to improve the performance and stability of the Wanka cluster .
In order to improve training performance, TeleAI adopts multi-dimensional hybrid parallelism. By setting different parallel modes, it can realize the automatic mixed use of data parallelism, model parallelism and pipeline parallelism, and support efficient distributed training of trillions of models and tens of thousands of card clusters.
The following key technologies are also used in this training to further improve training performance:
-
Multi-copy parallelism : By splitting the input model data according to the batch size dimension, the other copy can perform calculation operations while the bottom layer is communicating, without waiting, thus improving model performance.
-
Communication optimization : Reduce communication time and improve training performance through technologies such as communication fusion and communication subgraph extraction and reuse.
-
DryRun simulation : Without actually performing calculations, the calculation graph is analyzed on a small cluster to identify performance bottlenecks, such as operator fusion, video memory usage, and data flow efficiency issues, and to provide optimized configuration for the Wanka cluster operation in advance.
-
Flexible recomputing configuration : Combined with the video memory usage analysis of DryRun, through multiple configurations such as computing recomputing, communication recomputing, and designated recomputing, the optimal balance point between video memory and computing is found to maximize performance while meeting the video memory limit of a single card.
In the end, the performance of the domestically produced Wanka computing cluster exceeded that of the corresponding GPU by more than 93%.
In addition, in order to improve training stability, through methods such as online training cluster breakpoint resumption, CCAE cluster monitoring and rapid isolation of faulty nodes, and multi-level storage optimization, the cluster achieved 98% stable availability, a breakpoint resumption success rate of more than 90%, and a single breakpoint resumption time of about 15 minutes.
The second challenge is to train large models with trillions of parameters.
In the process of training ultra-large parameter models, TeleAI explored the Scaling Law by training a large number of small models, analyzed the noise space of each model, and constructed positive excitation noise to strengthen noise management during training . As the core technology for training ultra-large parameter models, positive excitation noise helps researchers determine the optimal model structure, thereby improving the overall capability and robustness of the model.
To this end, TeleAI adopted a “four-step” strategy.
First, in terms of model building , multiple technologies are used for optimization.
First, in terms of position encoding, the Rotary Embedding position encoding method is adopted. This method has excellent position extrapolation and can work well with attention calculation acceleration technology, thereby greatly improving the training speed of the model.
Second, at the activation function level, the SwiGLU activation function was used instead of the GELU activation function. During the experiment, TeleAI also confirmed that SwiGLU has a better model fitting effect than other activation functions.
Third, in the layer normalization phase, we use the Pre-Normalization based on RMSNorm. Experiments show that this algorithm has better stability during the training process.
Fourth, decouple the word embedding layer from the output lm head layer parameters. Experiments show that this can enhance training stability and convergence.
其五,在大参数模型(TeleChat2-115B)上应用 GQA,可提高模型训练和推理性能。GQA 能大幅降低模型推理过程中的显存使用量,显著提升模型外推长度和推理性能。
In addition, in terms of basic training data construction , TeleAI uses multi-level pilot models in engineering practice to carry out detailed follow-up training and data adjustment experiments, and fully evaluates and verifies the effectiveness of data cleaning and data mixing strategies.
First, in terms of data cleaning, language recognition, data deduplication, text format normalization, irrelevant content filtering, and low-quality content filtering are used to improve the quality of pre-training data.
At the same time, we built a multimodal structured document parsing tool to effectively extract formulas and table contents. Experiments have found that after data cleaning, the model training loss is lower, the learning speed is faster, and 43% of training time can be saved.
Second, regarding data mixing, an online domain sampling weight adjustment algorithm is used. During the pilot model training process, the sampling weights are dynamically updated according to the sample loss distribution of different data sets to obtain the most effective data mixing strategy.
In the early stages of model training, the ratio scheme will be continuously adjusted according to changes in evaluation indicators. Experiments have shown that increasing the proportion of Chinese data and increasing the proportion of mathematics and question bank data can help improve the model's text comprehension and test-taking capabilities.
Third, in terms of data synthesis, for tasks in specific fields such as mathematics and code, we sort out a fine-grained knowledge point system and construct complex instructions to allow large models to generate synthetic data with high knowledge density, such as test question parsing process, code function explanation, code calling relationship, etc.
The next step is SFT (model fine-tuning) special optimization .
In terms of low-quality filtering, indicators such as model perplexity (PPL), instruction following difficulty (IFD) and learnability are used to measure the difficulty of answering a single sample, and then automatically screen and filter out samples with poor text format standardization and incorrect answer annotations.
For high-quality construction, SFT is divided into three ability dimensions of logic, cognition, and understanding, and more than 20 subcategories. Through the pre-established standard evaluation set, high-quality data with the greatest impact on the improvement of single ability indicators are screened out.
At the same time, a two-stage labeling scheme for question-and-answer data based on the golden template is proposed. The best template for each type of question is summarized from the dimensions of standardization, novelty, logic, richness, and completeness, and then the best answer that meets the requirements is labeled according to the template.
In terms of effect selection, based on the model perplexity indicator, we can quickly evaluate the degree of fit of different versions of the model on a small-scale validation set, so as to select the version with better performance and reduce the computational cost.
Then there is preference alignment .
To ensure the comprehensiveness and balance of instruction data to the greatest extent possible, TeleAI classified and collected instruction data sets covering a total of 300 categories. At the same time, in order to obtain higher-quality instruction data, clustering and center selection algorithms were used to select representative instructions.
Subsequently, TeleAI classified the responses from the TeleChat series models at different training stages and with different parameter sizes into three different labels: high quality, medium quality, and low quality, based on multiple dimensions such as security, factuality, and fluency, to form pair-wise data for training the reward model.
The DPO algorithm is widely used because of its simple engineering implementation and ease of training. This strategy is also adopted in the TeleChat training phase. During the data construction phase, TeleAI uses the instruction data to perform 10 to 15 inference samples on the current Chat model and uses the reward model to score each reply.
TeleAI uses the West-of-N method to construct pair data, that is, the highest score answered by the model is taken as the chosen response, and the lowest score is taken as the rejected response, so as to ensure that the pair data has strong preference differences.
During the training phase, in addition to using the conventional DPO loss function, TeleAI also discovered through experiments that introducing NLL Loss (negative log-likelihood loss) for the chosen response can effectively stabilize the effect of DPO training and prevent the probability of the chosen response from decreasing.
Finally, it is to reduce the factual hallucination of semantic big models based on knowledge graph .
Specifically, TeleAI introduces knowledge into question prompts based on graph structured information representation: it retrieves candidate entities based on the n-gram similarity with the query, then performs random walks based on this, calculates the relevance of the walk path with the user's original question, and selects the top path content to expand to the user's original question.
The above is the key process of TeleAI “refining” thousands of cards and parameters.
But there is one more question worth discussing:
Why can China Telecom Artificial Intelligence Research Institute do it?
In fact, TeleAI's layout on the big model was not achieved overnight, but has been polished for a long time.
First of all, we should attach great importance to it in attitude.
In addition to the Xingchen AI big model, at the Digital Technology Ecosystem Conference held in November last year, TeleAI also released 12 industry big models and launched the "Xingchen MaaS Ecosystem Service Platform" to achieve customized services.
All of this is based on China Telecom’s decade-long AI capability building.
Secondly, only with talents can we have the support of industry leaders.
In order to build the Xingchen AI model, China Telecom quickly formed a research and development team of nearly 800 people. The team members come from top domestic and foreign universities, such as Tsinghua University, Peking University, Stanford University and Columbia University, with an average age of 31.79 years old.
This group of outstanding talents will help China Telecom replace external algorithm capabilities in its internal and external businesses and achieve independent control of core algorithm capabilities.
While absorbing a wide range of basic talents, China Telecom also has a group of industry experts, among whom is Li Xuelong, who joined China Telecom Group full-time as CTO and chief scientist at the end of last year.
As a Fellow Grand Slam player in the field of AI, Li Xuelong innovatively proposed that noise analysis is the core key to solving a series of artificial intelligence problems such as large models . He introduced this idea into the Wanka Wancan project, and will also lead China Telecom Artificial Intelligence Research Institute to continue to carry out basic and cutting-edge research.
When TeleAI was founded, it focused on the two major elements of "people" and "labor".
It is understood that TeleAI has now introduced many professors from top overseas universities, CTOs or scientists from well-known domestic companies, young talents from scientific research institutions, and talented students with high-impact open source results.
And it’s not limited to AI and big models. China Telecom has invested in many technologies and has also gained advantages over its peers. This is exactly what “industry” reflects.
For example, in quantum communication, China Telecom recently launched the "Tianyan" quantum computing cloud platform with "quantum superiority" capabilities. It has also previously opened the country's largest, most user-friendly and most comprehensive quantum secure communication metropolitan area network, and has taken the lead in formulating 5 of the 7 quantum communication industry standards (including group standards) that were first initiated by central enterprises.
For example, in the new generation of information and communication technology, China Telecom has achieved full commercialization of "mobile phone direct connection to satellite" and released the world's first operational-level product that supports consumer-grade 5G terminals directly connected to satellites for two-way voice and text messaging.
From this we can see that China Telecom is no longer a traditional operator in everyone's eyes, and its investment in cutting-edge technologies is much deeper than we realize.
It is not difficult to understand why TeleAI can be the first to achieve 10,000 cards and 10,000 participants.
-over-
Click here ???? Follow me, remember to mark the star~