This post was last edited by cc1989summer on 2024-10-12 21:02
Earlier, we shared the convolutional neural network, which is also the most basic tool and model in machine learning.
In the previous introduction, we tried to run a handwriting recognition case using a convolutional neural network. This time we come to natural language processing.
Natural language processing (NLP) refers to the study of the interaction between computers and humans using natural language. In practice, it is very common to use natural language processing techniques to process and analyze text data, such as semantic recognition and machine translation.
Human language is an abstract information symbol that contains rich semantic information, and humans can easily understand its meaning. However, computers can only process numerical information and cannot directly understand human language, so human language needs to be converted into numerical form.
NLP enables computers and digital devices to recognize, understand, and generate text and speech by combining computational linguistics (rule-based modeling of human language) with statistical modeling, machine learning (ML), and deep learning.
NLP research has ushered in the era of generative AI, which covers everything from the communication skills of large language models (LLMs) to the ability of image generation models to understand requests. NLP has become part of many people's daily lives, with applications such as powering search engines, prompting chatbots for customer service with voice commands, voice-operated GPS systems, and digital assistants on smartphones.
The architecture of natural language processing is shown in the figure. It first needs to undergo pre-training, then deep learning through a neural network, and finally complete the expected application function (sentiment analysis or natural language inference).
For pre-training, the first thing is to process natural language and convert it into vectors that can be processed by machine learning.
Natural language is a complex system used to express human thinking. In this system, words are the basic units of meaning. As the name suggests, word vectors are vectors used to represent the meaning of words, and can also be considered as feature vectors or representations of words. The technique of mapping words to real vectors is called word embedding ( WORD2VEC ) .
Why do we need to encode words (word vectorization)?
The input of any mathematical model needs to be numerical, because computers can only understand numbers. Words are abstract summaries of human language, which computers cannot understand. In natural language processing, we are dealing with text, which cannot be directly used by mathematical models. Therefore, we need to encode the text and represent each character with a vector.
How to better represent words? First, we need to understand a basic assumption in the field of NLP - the distributional hypothesis. The distributional hypothesis means that words with similar contexts have similar meanings. In other words, the meaning of a word is determined by its context. For example, if "apple" appears with words such as "banana", "pear", and "one pound", it is likely to represent a kind of fruit; if "apple" appears with words such as "mobile phone", "Xiaomi", and "ipad", it is likely to represent a technology product or a technology company, and it is far from the concept of fruit.
For example, converting words into 50-dimensional word vectors: the two similar words man and boy are relatively similar.
Two models in word2vec mode: CBOW and SkipGram
CBOW model: predicts the word through the context of one or more words.
Skip Gram model: predicts the context through one or more words.
Here we focus on the CBOW (Continuous Bag of Words) model, which predicts the current word based on the context vocabulary.
That is, use
to predict
.
In the CBOW model, the input is the word vector in the context, which is then multiplied by the input weight matrix. The resulting vector is averaged as the hidden layer vector, which is then multiplied by the output weight matrix. The output is the vector of the target word. In this process, the target loss is continuously minimized. In the Skip-gram model, the input is the vector of the target word, and the output is the word vector in the context. (The input layer generally uses one-hot to convert text -> vector).
What is one-hot? One-Hot Encoding, also known as single-bit effective encoding, is a method of representing discrete variables (categorical data).
For example, we have a sentence: "I drink coffee everyday ", we segment it and perform one-hot encoding on it, the result is:
- I: [1, 0, 0, 0]
- drink:[0, 1, 0, 0]
- coffee:[0, 0, 1, 0]
- everyday:[0, 0, 0, 1]
We choose coffee as the central word and set the window size to 2. That is, we want to predict a word based on the words " I", "drink " and "everyday" , and we hope that this word is coffee .
The figure below shows the word2vec process.
Regardless of whether it is the CBOW model or the skip-gram model, word2vec can generally provide high-quality word vector expressions. The following figure is a visualization of the 128-dimensional skip-gram word vector obtained by training 50,000 words compressed into a 2-dimensional space:
It can be seen that words with similar meanings are basically grouped together, which also proves that word2vec is a reliable way to represent word vectors.
Let's run the Word2vec model based on CBOW
1 Define a list of sentences, which will be used to train the CBOW model later
import numpy as np
from torch import nn from
torch.nn import functional as F
#Define a list of sentences that will be used to train the CBOW and Skip-Gram models
later
sentences = ["Kage is Teacher", "Mazong is Boss", "Niuzong is Boss",
"Xiaobing is Student", "Xiaoxue is Student",]
#Join all sentences together and separate them into multiple words separated by spaces
words = ' '.join(sentences).split()
#Build a vocabulary and remove duplicate words
word_list = list(set(words))
#Create a dictionary that maps each word to a unique index
word_to_idx = {word: idx for idx, word in enumerate(word_list)}
#Create a dictionary that maps each index to the corresponding word
idx_to_word = {idx: word for idx, word in enumerate(word_list)}
voc_size = len(word_list) #Calculate the size of the vocabulary
print(" Vocabulary:", word_list) # Output vocabulary
print(" Dictionary from vocabulary to index: ", word_to_idx) # Output vocabulary to index dictionary
print(" Dictionary from index to vocabulary: ", idx_to_word) # Output index to vocabulary dictionary
print(" Vocabulary size: ", voc_size) # Output vocabulary size
Operation results:
2. Generate CBOW training data
Code:
#Generate CBOW training data
def create_cbow_dataset(sentences, window_size=2):
data = [] #Initialize datafor
sentence in sentences:
sentence = sentence.split() #Split the sentence into a list of wordsfor idx
, word in enumerate(sentence): #Traverse the words and their indices # Get the context vocabulary and use the window_size words before and after the current word as the surrounding
wordscontext_words = sentence[max(idx - window_size, 0):idx] \
+ sentence[idx + 1:min(idx + window_size + 1, len(sentence))] #Use the current word and context vocabulary as a set of training
datadata.append((word, context_words))
return data #Use function to create CBOW training
datacbow_data = create_cbow_dataset(sentences) #Print unencoded CBOW data samples (the first three)
print("CBOW data sample (unencoded):", cbow_data[:3])
Operation results:
3. Define One-Hot Encoding Function
def one_hot_encoding(word, word_to_idx):
tensor = torch.zeros(len(word_to_idx)) #Create a full -zero tensor with the same length as the vocabulary tensor
[word_to_idx[word]] = 1 #Set the index of the corresponding word to 1
return tensor #Return the generated One-Hot vector
4. Define the CBOW model
# Define CBOW model
import torch.nn as nn # Import neural network
class CBOW(nn.Module):
def __init__(self, voc_size, embedding_size):
super(CBOW, self).__init__()
# Linear layer from vocabulary size to embedding size (weight matrix)
self.input_to_hidden = nn.Linear(voc_size,
embedding_size, bias=False)
# Linear layer from embedding size to vocabulary size (weight matrix)
self.hidden_to_output = nn.Linear(embedding_size,
voc_size, bias=False)
def forward(self, X): # X: [num_context_words, voc_size]
# Generate embeddings: [num_context_words, embedding_size]
embeddings = self.input_to_hidden(X)
# Calculate the hidden layer and find the mean of the embedding: [embedding_size]
hidden_layer = torch.mean(embeddings, dim=0)
# Generate output layer: [1, voc_size]
output_layer = self.hidden_to_output(hidden_layer.unsqueeze(0))
return output_layer
embedding_size = 2 # Set the size of the embedding layer. 2 is selected here for easy display.
cbow_model = CBOW(voc_size,embedding_size) # Instantiate the CBOW model
print("CBOW model:", cbow_model)
5 Training the cbow model
#Train the cbow class
learning_rate = 0.001 #Set the learning rate
epochs = 1000 #Set the training round
criterion = nn.CrossEntropyLoss() #Define the cross entropy loss function
import torch.optim as optim #Import the stochastic gradient descent optimizer
optimizer = optim.SGD(cbow_model.parameters(), lr=learning_rate) #Start the training loop
loss_values = [] #Used to store the average loss value of each round
for epoch in range(epochs):
loss_sum = 0 #Initialize the loss value
for target, context_words in cbow_data: #Convert context words to One-Hot vectors and stack them
X = torch.stack([one_hot_encoding(word, word_to_idx) for word in context_words]).float() #Convert target words to index values
y_true = torch.tensor([word_to_idx[target]], dtype=torch.long)
y_pred = cbow_model(X) #Calculate the predicted value
loss = criterion(y_pred, y_true) #Calculate the loss
loss_sum += loss.item() #Accumulate loss
optimizer.zero_grad() #Clear gradient
loss.backward() #Backward propagation
optimizer.step() #Update parameters
if (epoch+1) % 100 == 0: #Output the loss of every 100 rounds and record the loss
print(f"Epoch: {epoch+1}, Loss: {loss_sum / len(cbow_data)}")
loss_values.append(loss_sum / len(cbow_data)) #Draw the training loss curve
import matplotlib.pyplot as plt #Import matplotlib #Draw a two-dimensional word vector graph
plt.rcParams["font.family"]=['SimHei'] #Used to set the font style plt.rcParams
['font.sans-serif']=['SimHei'] #Used to set the sans serif font style
plt.rcParams['axes.unicode_minus']=False #Used to display the negative sign normally plt.plot(range(1, epochs//100 + 1), loss_values) #Drawing plt.title (
' Training
Loss Curve') #Title plt.xlabel
(' Rounds') # X- axis Label
plt.ylabel(' Loss') # Y -axis Label
plt.show() #Display the graph
6 Output cbow learned word embedding
# Output word embedding learned by cbow
print("CBOW word embedding:")
for word, idx in word_to_idx.items(): # Output the embedding vector of each word
print(f"{word}: {cbow_model.input_to_hidden.weight[:,idx].detach().numpy()}")
The running results are:
CBOW word embedding:
Niuzong: [0.46508402 0.55232465]
Teacher: [0.24856524 0.62238467]
is: [-0.6280461 -0.5844824]
Mazong: [0.15402862 0.36817124]
Xiaobing: [0.67069155 0.09598981]
Boss: [1.1241493 0.4596834]
Student: [0.44188187 0.6775399]
Kage: [0.5566621 0.48963603]
Xiaoxue: [0.8823291 0.12908652]
7. Vector Visualization
fig, ax = plt.subplots()
for word, idx in word_to_idx.items():
# Get the embedding vector for each word
vec = cbow_model.input_to_hidden.weight[:,idx].detach().numpy()
ax.scatter(vec[0], vec[1]) # Draw the points of the embedded vector in the figure
ax.annotate(word, (vec[0], vec[1]), fontsize=12) # Add word labels next to the points
plt.title('2D word embedding') # Figure title
plt.xlabel('Vector dimension 1') # X-axis Label
plt.ylabel('Vector dimension 2') # Y axis Label
plt.show() # Display the graph
The running results are:
You can see that words with similar meanings are placed in similar positions.
That’s all for this sharing.
For all the codes, see:
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
import torch
import torch.optim as optim
# 定义一个句子列表,后面会用这些句子来训练 CBOW 和 Skip-Gram 模型
sentences = ["Kage is Teacher", "Mazong is Boss", "Niuzong is Boss",
"Xiaobing is Student", "Xiaoxue is Student",]
# 将所有句子连接在一起,然后用空格分隔成多个单词
words = ' '.join(sentences).split()
# 构建词汇表,去除重复的词
word_list = list(set(words))
# 创建一个字典,将每个词映射到一个唯一的索引
word_to_idx = {word: idx for idx, word in enumerate(word_list)}
# 创建一个字典,将每个索引映射到对应的词
idx_to_word = {idx: word for idx, word in enumerate(word_list)}
voc_size = len(word_list) # 计算词汇表的大小
# 生成 CBOW 训练数据
def create_cbow_dataset(sentences, window_size=2):
data = []# 初始化数据
for sentence in sentences:
sentence = sentence.split() # 将句子分割成单词列表
for idx, word in enumerate(sentence): # 遍历单词及其索引
# 获取上下文词汇,将当前单词前后各 window_size 个单词作为周围词
context_words = sentence[max(idx - window_size, 0):idx] \
+ sentence[idx + 1:min(idx + window_size + 1, len(sentence))]
# 将当前单词与上下文词汇作为一组训练数据
data.append((word, context_words))
return data
# 使用函数创建 CBOW 训练数据
cbow_data = create_cbow_dataset(sentences)
# 打印未编码的 CBOW 数据样例(前三个)
def one_hot_encoding(word, word_to_idx):
tensor = torch.zeros(len(word_to_idx)) # 创建一个长度与词汇表相同的全 0 张量
tensor[word_to_idx[word]] = 1 # 将对应词的索引设为 1
return tensor # 返回生成的 One-Hot 向量
# 展示 One-Hot 编码前后的数据
word_example = "Teacher"
print("One-Hot 编码前的单词:", word_example)
print("One-Hot 编码后的向量:", one_hot_encoding(word_example, word_to_idx))
# 定义 CBOW 模型
import torch.nn as nn # 导入 neural network
class CBOW(nn.Module):
def __init__(self, voc_size, embedding_size):
super(CBOW, self).__init__()
# 从词汇表大小到嵌入大小的线性层(权重矩阵)
self.input_to_hidden = nn.Linear(voc_size,
embedding_size, bias=False)
# 从嵌入大小到词汇表大小的线性层(权重矩阵)
self.hidden_to_output = nn.Linear(embedding_size,
voc_size, bias=False)
def forward(self, X): # X: [num_context_words, voc_size]
# 生成嵌入:[num_context_words, embedding_size]
embeddings = self.input_to_hidden(X)
# 计算隐藏层,求嵌入的均值:[embedding_size]
hidden_layer = torch.mean(embeddings, dim=0)
# 生成输出层:[1, voc_size]
output_layer = self.hidden_to_output(hidden_layer.unsqueeze(0))
return output_layer
embedding_size = 2 # 设定嵌入层的大小,这里选择 2 是为了方便展示
cbow_model = CBOW(voc_size,embedding_size) # 实例化 CBOW 模型
print("CBOW 模型:", cbow_model)
# 训练 cbow 类
learning_rate = 0.001 # 设置学习速率
epochs = 1000 # 设置训练轮次
criterion = nn.CrossEntropyLoss() # 定义交叉熵损失函数
import torch.optim as optim # 导入随机梯度下降优化器
optimizer = optim.SGD(cbow_model.parameters(), lr=learning_rate)
# 开始训练循环
loss_values = [] # 用于存储每轮的平均损失值
for epoch in range(epochs):
loss_sum = 0 # 初始化损失值
for target, context_words in cbow_data:
# 将上下文词转换为 One-Hot 向量并堆叠
X = torch.stack([one_hot_encoding(word, word_to_idx) for word in context_words]).float()
# 将目标词转换为索引值
y_true = torch.tensor([word_to_idx[target]], dtype=torch.long)
y_pred = cbow_model(X) # 计算预测值
loss = criterion(y_pred, y_true) # 计算损失
loss_sum += loss.item() # 累积损失
optimizer.zero_grad() # 清空梯度
loss.backward() # 反向传播
optimizer.step() # 更新参数
if (epoch+1) % 100 == 0: # 输出每 100 轮的损失,并记录损失
print(f"Epoch: {epoch+1}, Loss: {loss_sum/len(cbow_data)}")
loss_values.append(loss_sum / len(cbow_data))
import matplotlib.pyplot as plt
# 输出 cbow 习得的词嵌入
print("CBOW 词嵌入:")
for word, idx in word_to_idx.items(): # 输出每个词的嵌入向量
print(f"{word}: {cbow_model.input_to_hidden.weight[:,idx].detach().numpy()}")
fig, ax = plt.subplots()
for word, idx in word_to_idx.items():
# 获取每个单词的嵌入向量
vec = cbow_model.input_to_hidden.weight[:, idx].detach().numpy()
ax.scatter(vec[0], vec[1]) # 在图中绘制嵌入向量的点
ax.annotate(word, (vec[0], vec[1]), fontsize=12) # 点旁添加单词标签
plt.title(' 二维词嵌入 ') # 图题
plt.xlabel(' 向量维度 1') # X 轴 Label
plt.ylabel(' 向量维度 2') # Y 轴 Label
plt.show() # 显示图