A paper by a Chinese born in the 2000s was published in Nature. The reliability of large models to humans has decreased

Latest update time：2024-10-03

Reads：

Yishui from Aofei Temple
Quantum Bit | Public Account QbitAI

A work by a Chinese born in the 2000s was published in Nature, and this large model paper caused heated discussion.

In short, the paper found that larger models that follow instructions more closely have become less reliable, and in some cases GPT-4 is even less reliable than GPT-3 in answering .

The latest models, which have more computing power and human feedback, actually got worse at answering reliably than earlier models.

As soon as the conclusion came out, it immediately attracted more than 200,000 netizens to watch:

It also sparked discussion on the Reddit forum.

This reminds me that a large number of expert/PhD-level models cannot answer simple questions like "which is bigger, 9.9 or 9.11?"

Regarding this phenomenon, the paper mentioned that this also reflects that the performance of the model does not match human expectations of difficulty .

In other words, “LLMs both succeed and (more dangerously) fail where users least expect them.”

Ilya Sutskever predicted in 2022:

Perhaps over time this difference will decrease.

However, this paper found that this is not the case. Not only GPT, LLaMA and BLOOM series, but even OpenAI's new o1 model and Claude-3.5-Sonnet are also worrying in terms of reliability.

More importantly, the paper also found that relying on human supervision to correct errors did not work.

Some netizens believe that while larger models may bring reliability issues, they also provide unprecedented functionality.

We need to focus on developing robust evaluation methods and increasing transparency.

Others say the research highlights the delicate challenges facing artificial intelligence (balancing model scaling with reliability) .

Larger models are less reliable, and relying on human feedback is no longer effective

To illustrate the conclusions, the paper examines three key aspects that influence the reliability of LLMs from a human perspective:

1. Difficulty inconsistency : Do LLMs fail where humans expect them to fail?
2. Task avoidance : Do LLMs avoid answering questions that are beyond their capabilities?
3. Sensitivity to prompt wording : Is the effectiveness of question wording affected by the difficulty of the question?

More importantly, the authors also analyze historical trends and how these three aspects evolve with task difficulty.

Let’s expand on them one by one below.

For the first question, the paper focuses on the evolution of correctness relative to difficulty .

From the evolution of GPT and LLaMA, as the difficulty increases, the accuracy of all models will obviously decrease. (Consistent with human expectations)

However, these models still cannot solve many very simple tasks.

This means that human users cannot discover the safe operating space of LLMs and use it to ensure that the model's deployed performance will be flawless.

Surprisingly, the new LLMs mainly improve performance on difficult tasks, but do not provide significant improvements on easier tasks. GPT-4 compared to its predecessor GPT-3.5-turbo .

The above demonstrates that there is an inconsistency between human difficulty expectations and model performance, and this inconsistency is exacerbated in the new model.

This also means:

There is currently no way for humans to determine the safe operating conditions under which LLMs can be trusted.

This is particularly worrisome in applications that require high reliability and identification of safe operating spaces, which makes us reflect on whether the cutting-edge machine intelligence that humans are striving to create is really what society wants.

Secondly, regarding the second point of the paper’s findings (avoidance usually means that the model deviates from the answer to the question, or directly points out “I don’t know”) :

Compared to earlier LLMs, the latest LLMs have significantly improved the number of wrong or downright nonsense answers , rather than carefully avoiding tasks that are beyond their capabilities.

This also leads to an ironic phenomenon: in some benchmarks, the error rate of the new LLMs improves even faster than the accuracy rate (doge).

Generally speaking, humans are more likely to be vague when faced with a difficult task.

But the actual performance of LLMs is quite different. Research shows that their avoidance behavior has no obvious correlation with difficulty.

This can easily lead to users initially over-relying on LLMs for tasks they are not good at, only to set them up for frustration in the long run.

As a result, humans also need to verify the accuracy of the model output and detect errors. (If you want to use LLMs to be lazy, you will lose a lot of opportunities.)

Finally, the paper found that even though some reliability indicators improved, the model was still sensitive to small changes in the formulation of the same problem.

For example , asking “Can you answer…?” instead of “Please answer the following question…” will result in different levels of accuracy.

The analysis found that existing scaling-up and shaping-up alone are unlikely to fully solve the problem of indicator sensitivity, as the latest models are not significantly improved compared to their predecessors.

Moreover, even if the presentation format with the best average performance is chosen, it may be mainly effective for high-difficulty tasks, but ineffective (with a higher error rate) for low-difficulty tasks .

This suggests that humans are still subject to cue engineering .

Even more frightening, the paper found that human supervision could not alleviate the unreliability of the model .

The paper analyzes whether humans’ perception of difficulty is consistent with actual performance and whether humans can accurately evaluate the model’s output based on human surveys.

The results show that in operation regions that users consider difficult, they often regard incorrect outputs as correct; even for simple tasks, there is no safe operation region with both low model error and low supervision error.

The above unreliability issues exist in multiple LLMs series, including GPT, LLaMA and BLOOM, and 32 models are listed in the study .

These models exhibit different scaling-up (increasing computation, model size, and data) and shaping-up (e.g., instruction FT, RLHF).

In addition to the above, the authors later discovered that some of the latest and most powerful models also have the unreliability issues mentioned in this article:

Including OpenAI's o1 model, Antropicic's Claude-3.5-Sonnet, and Meta's LLaMA-3.1-405B .

There is also a document that gives examples (see the original document for details) :

In addition, in order to verify whether other models have reliability issues, the author also open-sourced the test benchmark ReliabilityBench used in the paper .

This is a dataset with five domains, simple arithmetic (“Addition”), word reorganization (“Charades”), geographical knowledge (“Location”), basic and advanced science questions (“Science”), and information-centered transformation (“Transformation”).

About the Author

The first author of the paper, Lexin Zhou , has just graduated with a master's degree in CS from Cambridge University (24 years old), and his research interest is the evaluation of large language models.

Prior to this, he obtained a Bachelor's degree in Data Science from the Polytechnic University of Valencia, supervised by Professor Jose Hernandez-Orallo.

His personal homepage shows that he has had multiple work internships. He participated in red team testing at OpenAI and Meta. (Red Teaming Consultancy)

Regarding this paper, he focused on:

A fundamental shift is needed in the design and development of general AI , especially in high-stakes domains where predictable error distribution is critical. Until this is achieved, reliance on human oversight is a danger.

When evaluating models, taking into account what humans consider difficult and evaluating the model’s avoidance behavior can provide a more comprehensive description of the model’s capabilities and risks, rather than just focusing on performance on difficult tasks.

The paper also specifically mentions some possible causes of these unreliabilities and solutions:

In scaling-up, benchmarks in recent years tend to add more difficult examples or give more weight to so-called "authoritative" sources. As a result, researchers tend to optimize the performance of models on difficult tasks, resulting in a chronic deterioration in difficulty consistency.

In shaping-up (such as RLHF), the employed humans tend to penalize answers that circumvent the task, making the model more likely to "talk nonsense" when faced with difficult problems that it cannot solve.

As for how to solve these unreliabilities , the paper believes that we can use human difficulty expectations to better train or fine-tune the model, or use task difficulty and model confidence to better teach the model to avoid problems beyond its own capabilities, and so on.

What do you think about this?

Article:
https://www.nature.com/articles/s41586-024-07930-y

Reference links:
[1] https://x.com/lexin_zhou/status/1838961179936293098
[2] https://huggingface.co/datasets/lexin-zhou/ReliabilityBench
[3] https://lexzhou.github.io/

-over-

In the selection

「2024 Artificial Intelligence Annual Selection」

The registration channel for the QuantumBit 2024 Artificial Intelligence Annual Awards has been opened. The awards have been divided into five categories based on the three dimensions of enterprise , person , and product .

Welcome to scan the QR code to sign up for the selection! The selection results will be announced at the MEET2025 Smart Future Conference in December . We look forward to witnessing the honorary moment with millions of practitioners.

Click here ???? Follow me, remember to mark the star~