Reading notes on Chapters 4-6 of "Principles and Engineering Practice of Large Language Models" Pre-training, supervised fine-tuning, and RLHF

a54137621

Reading notes on Chapters 4-6 of "Principles and Engineering Practice of Large Language Models" Pre-training, supervised fine-tuning, and RLHF [Copy link]

As a data mining practitioner and AI enthusiast, reading Chapters 4-6 of the book "Large Language Model: Principles and Engineering Practice" gave me a deeper understanding of the core technologies of large language models. These three chapters focus on pre-training, supervised fine-tuning, and reinforcement learning human feedback (RLHF), which constitute the complete training chain of large language models from "capability acquisition" to "task adaptation" to "value alignment".

Pre-training: Building the basic capabilities of large language models

Pre-training is a critical stage for large language models to acquire basic language capabilities. Chapter 4 introduces the core elements of pre-training in detail, including model architecture, training objectives, and data allocation strategies.

1.1 Model Architecture

Large language models generally use the Transformer architecture, but each has its own characteristics in specific implementation. Take the GPT series as an example, it uses a pure decoder structure, which makes the model perform well in generation tasks. In contrast, models such as T5 use an encoder-decoder structure, which may have more advantages in certain tasks.

It is worth noting that as the model size increases, some seemingly minor architectural adjustments may have a significant impact. For example, GPT-3 introduces alternating dense and sparse self-attention layers, a design that greatly reduces computational complexity while maintaining the expressiveness of the model.

1.2 Training Objectives

The pre-training goal of a large language model is usually to maximize the likelihood probability of the sequence. Specifically, for the input sequence x = (x1, x2, ..., xT), the model's goal is to minimize the following loss function:

$$
L(\theta) = -\sum_{t=1}^T \log p_\theta(x_t|x_{<t})
$$

Among them, θ represents the model parameters, and p θ(x_t|x {<t}) represents the probability of the model predicting the next token given all the previous tokens.

This seemingly simple objective function actually contains profound linguistic insights. By predicting the next token, the model is forced to learn the grammatical rules, semantic relations, and contextual dependencies of the language, thereby gaining powerful language understanding and generation capabilities.

1.3 Data Allocation Strategy

The quality and diversity of pre-training data directly affects the performance of the model.

Here we have to mention the scaling laws, which mainly study the impact of various factors on the loss of large models. I have learned that the relevant formulas are as follows:

Here x can refer to variables such as model size, pre-training data size, number of training steps, and amount of computation.

The proportion of data mixing affects the model loss in a quantitatively predictable way

In order to discover the data mixing rules, the following two challenges need to be addressed:

(i) Multivariate: For a data mixing law with K data domains, the mixing ratio has K1 degrees of freedom and, accordingly, there are K–1 variables in the objective function. The increase in variables complicates the identification of the functional form.

(ii) Non-monotonicity: The monotonic relationship between the loss and the scale of any domain suggests that an unbalanced mixture can achieve the minimum loss, which is inconsistent with practice. Therefore, unlike the existing scaling law where the loss decreases monotonically with the scale of the relevant factors, the data mixture law should accommodate the non-monotonicity function.

As the saying goes, "garbage in, garbage out". Based on my personal practical experience, I selected the three types of data with the highest proportion, namely Internet data (more than 60%), high-quality or vertical field data, and code data. Internet data is selected to obtain extensive knowledge and information and improve generalization ability. High-quality or vertical field data is selected to improve the performance of the model in a specific field and increase the knowledge depth of the model. Code data is selected to assist the model in understanding and generating code, and to cultivate the model's logical thinking and structured thinking ability.

In fact, Section 3.4 of this book mentions that high-quality language data is expected to be exhausted around 2026, which is the so-called "data shortage". In the face of the challenge of data scarcity, it is particularly important to adopt innovative strategies. I further expanded my understanding of several strategies aimed at broadening data sources, enhancing data diversity, and thus improving the effectiveness and breadth of model training.

(1) Data augmentation

Expanding the data set through advanced technical means includes two core directions: data generation and data transformation. Data generation uses cutting-edge technologies such as generative adversarial networks (GANs) and variational autoencoders (VAEs) to simulate and generate new data samples. These samples are based on the characteristics of existing data but are different, which cleverly expands the data scale. On the other hand, data transformation enhances the diversity of the data set through non-repetitive changes by applying diversified operations to existing data, such as geometric transformation of images, color adjustment, or word order adjustment of text, synonym substitution, etc.

(2) Cross-domain data migration and sharing

The data transfer learning strategy is to transfer the rich knowledge and core features learned from the big data source field to the target task with limited data volume to achieve "knowledge transfer". In specific implementation, the model is first pre-trained on a large-scale general data set, and then fine-tuned for the target small data set, making full use of the universal features extracted by the pre-trained model to effectively enhance the performance of the model on the small data set. In addition, the data reuse mechanism also plays an important role across tasks and across fields, that is, under the premise of ensuring the relevance of tasks, the collected data resources are shared in multiple related tasks, such as the auxiliary role of text classification data in the field of natural language processing for sentiment analysis tasks.

(3) Multimodal data integration

Integrating multiple types of data modalities, including text, images, audio, video, etc., is a key way to improve the comprehensive analysis capabilities of the model. The complementarity and mutual verification of multimodal data essentially enrich the information level that the model can learn, which not only improves the generalization ability of the model, but also enhances its adaptability and robustness to complex environments.

(4) AI-driven synthetic data applications

Using artificial intelligence technology to synthesize data provides an innovative solution to the problem of data scarcity. By generating high-quality synthetic data, the lack of real data can be effectively compensated, especially for data sets that are difficult or expensive to collect. Autonomous driving is a typical example. By using synthetic data for training in a virtual environment, the reliance on expensive and risky real road tests is greatly reduced.

Supervised fine-tuning: the key to task adaptation

Chapter 5 discusses in detail supervised fine-tuning techniques, a key step in adapting large language models to specific tasks.

2.1 Fine-tuning method

(1) Full parameter fine-tuning: Although it has the best effect, it requires huge computing resources.

(2) Adapter fine-tuning: inserting small trainable modules between Transformer layers.

(3) Prefix fine-tuning: Add a trainable prefix vector before the input sequence.

(4) Hint fine-tuning: Adapting downstream tasks by optimizing continuous hint vectors.

(5) Low-Rank Adaptation (LoRA). LoRA achieves efficient parameter fine-tuning by adding low-rank updates to the original weight matrix W:

$$
W = W_0 + BA
$$

Among them, W0 is the frozen pre-trained weight, B∈R^{d×r} and A∈R^{r×k} are low-rank matrices (r << min(d, k)). This method not only greatly reduces the number of trainable parameters, but also maintains high model performance.

2.2 Fine-tuning strategy

Several advanced fine-tuning strategies are also discussed in the book:

(1) Hybrid fine-tuning: Fine-tuning on multiple tasks simultaneously helps improve the generalization ability of the model.

(2) Reasoning based on contextual learning: Utilize the model’s in-context learning capability to guide the model to complete the task by providing a small number of examples.

(3) Reasoning based on thought chains: By explicitly guiding the model to perform step-by-step reasoning, its ability to solve complex problems is improved.

The core idea of these strategies is to make full use of the generalization and fast learning capabilities of large language models so that they can better adapt to various downstream tasks.

RLHF: A New Paradigm for Value Alignment

The RLHF technique introduced in Chapter 6 is the key to ensuring that the output of large language models conforms to human values. The core of RLHF is to build a reward model that can simulate human preferences and use this model to guide the optimization of the language model.

3.1 Reward Model

The reward model is trained to maximize the log-likelihood of human preferences:

$$
L_{RM} = -\frac{1}{N}\sum_{i=1}^N \log \frac{e^{r_\theta(x_i, y_i^w)}}{e^{r_ \theta(x_i, y_i^w)} + e^{r_\theta(x_i, y_i^l)}}
$$

Here, r_θ(x, y) is the score of the reward model for input x and output y, and y^w and y^l represent the “winner” and “loser” responses preferred by humans, respectively.

3.2 RLHF algorithm

RLHF uses the PPO (Proximal Policy Optimization) algorithm to optimize the language model. Its objective function is:

$$
L_{PPO}(\theta) = \mathbb{E}_{(x,y)\sim \pi_{\theta_{old}}}[\min( r_t(\theta) A_t, \text{clip }(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t )]
$$

Among them, r_t(θ) = π θ(y_t|x, y {<t}) / π {θ old}(y_t|x, y_{<t}) is the importance sampling ratio and A_t is the advantage function.

The subtlety of the design of this objective function lies in that it not only encourages the model to produce outputs with higher rewards, but also limits the amplitude of policy updates through the clip operation, thereby ensuring the stability of training.

There are many challenges in implementing RLHF, such as data bottlenecks, hardware bottlenecks, method bottlenecks, etc. Some techniques that may be useful to alleviate these problems, such as using contrast loss and introducing KL penalty terms, can be introduced into the training process. The application of these techniques reflects the perfect combination of theory and engineering in large language model training.

The training of a large language model is a complex process with multiple stages and multiple objectives. Pre-training gives the model basic language capabilities, fine-tuning adapts it to specific tasks, and RLHF ensures that its output is consistent with human values. These three stages are closely linked to build a powerful and controllable AI system.

We must also be aware that current technology is far from perfect. How to better utilize large-scale unlabeled data, how to improve the interpretability and controllability of models, and how to use more data while protecting privacy are all directions worthy of our in-depth exploration.

The development of large language models is not only a technological advancement, but also a continuous expansion of the boundaries of human cognition. As physicist Richard Feynman said, "If you think you understand quantum mechanics, you probably don't really understand it." Similarly, for large language models, our current understanding may only be the tip of the iceberg. But it is this unknown that inspires our passion for exploration. Every in-depth study and every practical attempt is a step towards a deeper understanding.

In this era of rapid development of AI, it is crucial to maintain a continuous learning attitude. Let us, with a thirst for knowledge and curiosity about technology, climb to the top in this technological revolution that is changing human society.

In the next 15 days, I will further carry out the third part of this reading plan - carefully reading chapters 7-8 of this book.

Jacktang

The training of a large language model is a complex process with multiple stages and multiple objectives.

Reading notes on Chapters 4-6 of "Principles and Engineering Practice of Large Language Models" Pre-training, supervised fine-tuning, and RLHF [Copy link]

Latest reply