Google reveals why large models cannot count r: embedding dimension is the key, not just a tokenizer issue

Latest update time：2024-09-04

Reads：

Cressy from Aofei Temple
Quantum Bit | Public Account QbitAI

The reason why the large model can easily solve Olympiad questions but repeatedly fails in simple counting has been found.

A new study from Google found that the reason why large models cannot count is not due to simple tokenizers, but that there is not enough space to store the vectors used for counting .

Counting the number of times a word appears in a sentence is a simple task that can stump many large models, and GPT-4o and Claude 3.5 are no exception.

If we go a step further and want to find the word with the highest frequency, it will be extremely difficult. Even if we can guess the specific number given, it will still be wrong.

Some people believe that the tokenization of vocabulary causes the "words" seen by the big model to be inconsistent with our perception, but the paper shows that the actual situation is not that simple.

To count words, the embedding dimension should be large enough

The counting ability of the Transformer is closely related to the relationship between its embedding dimension d and its vocabulary size m (the number of words in the vocabulary, not the sequence length) .

The detailed reason involves the mechanism of Transformer when counting word frequencies.

Transformer uses a special embedding method and the linear structure of the embedding space to cleverly transform the counting problem into vector addition .

Specifically, each word is mapped to a unique orthogonal vector. In this representation, the word frequency can be simply calculated by summing these orthogonal vectors .

However, a limitation of this mechanism is that it requires each word in the vocabulary to have an independent orthogonal vector representation, so the embedding dimension must be larger than the vocabulary size .

When the embedding dimension is insufficient, the word vector cannot maintain orthogonality, and the linear superposition of word frequencies cannot be achieved.

At this time, if Transformer wants to achieve counting, it can be achieved through the attention mechanism (CountAttend) , but it requires a large "reversal MLP" layer that grows linearly with the sequence length n.

Specifically, the model first assigns a larger weight to the query word through attention, and then uses position encoding to extract the attention weight to the last element of the value vector, which actually records the inverse of the frequency of occurrence of the query word.

This means that the model requires an MLP layer of size O(n) to calculate the 1/x function (x is the number of times a word appears) .

But further analysis shows that any constant layer ReLU network cannot approximate the 1/x function with o(n) number of neurons .

Therefore, for a fixed-size Transformer, this approach cannot be extended to sequences of arbitrary length. When the sequence length exceeds the length of the training set, the model's counting ability deteriorates sharply.

Length is not the main factor, the number of words in the vocabulary is the key

To verify this conclusion, the authors conducted two experiments.

The first experiment was conducted on a Transformer model trained from scratch with the following parameters:

Use a standard model consisting of two Transformer layers and four attention heads;
The embedding dimension d ranges from 8 to 128;
For each fixed d, the vocabulary size m varies from 5 to 150, and 20 different values are tested;
The model was trained from scratch using the Adam optimizer with a batch size of 16, a learning rate of 10^-4, and 100,000 training steps.

The training and evaluation data are generated by random sampling. First, n words are uniformly sampled from a vocabulary of size m to form a sequence of length n.

The sequence length n was set to n = 10m, the average number of occurrences of each word was fixed to 10 times, and a total of 1600 samples were used for testing.

The authors found that as the vocabulary size increases, the model's counting accuracy decreases in a step-like manner, and the critical point occurs exactly when the vocabulary size exceeds the embedding dimension .

To further quantify the counting ability of the model, the authors defined an indicator m_thr, which represents the critical vocabulary size when the counting accuracy of the model drops to 80%.

Intuitively, m_thr reflects the maximum vocabulary that the model can "bear" under a given embedding dimension. The larger the m_thr, the stronger the counting ability of the model.

The results show that for both the counting (QC) and the finding of the most frequent words (MFC) tasks, m_thr grows approximately linearly with the increase of the embedding dimension d .

The second experiment was conducted on the pre-trained Gemini 1.5 model. In this experiment, the author focused more on the impact of vocabulary size on counting ability.

They designed a series of counting tasks, each using a vocabulary of different sizes and fixing the average number of times each word appears in the sequence.

This means that in the experimental group, the larger the vocabulary, the longer the sequence length.

As a control, the authors also set up a "Binary Baseline" with a fixed vocabulary of only two words, but the sequence length was the same as the main experimental group.

This way, we can determine whether it is the vocabulary size or the sequence length that causes the model counting error.

Experimental results show that as the vocabulary size increases, the mean absolute error of Gemini 1.5 on the counting task increases significantly , while the error of the “Binary Baseline” is much lower.

This suggests that the increase in vocabulary size , rather than the increase in sequence length, is the main reason for the decrease in counting ability of large models.

However, the authors also stated that although this study has to some extent defined the upper and lower limits of the counting capabilities of large models, these limits are not tight enough and are still some distance away from the ideal results.

At the same time, the authors did not explore whether increasing the number of Transformer layers would change this conclusion, and new technical tools need to be developed in the future for further verification.

Paper address:
https://arxiv.org/abs/2407.15160

-over-

QuantumBit's annual AI theme planning Now soliciting!

Welcome to submit your contributions to the special topic 1,001 AI applications , 365 AI implementation solutions

Or share with us the AI products you are looking for or the new AI trends you have discovered

Click here ???? Follow me, remember to mark the star~