People quarrel with each other and get angry, but quarreling between AI and AI can bring safety
Text | sanman Editor: Yang Xiaofan
Report from Leiphone.com (leiphone-sz)
Leifeng.com AI Technology Review: A new article recently published by OpenAI briefly describes how to ensure the safety of the AI system by allowing it to correct its own problems through debate, with humans being the final evaluators of the debate. Since humans directly decide the outcome of the debate, they can keep the AI system's value orientation consistent with humans. The author believes that this method can ensure the safety of the AI system. The full translation of Leifeng.com AI Technology Review is as follows.
AI Safety via Debate
We propose a new AI safety technique that trains agents to debate topics and then has humans judge the winners. We think this or similar approaches could eventually help us train AI systems to perform tasks that exceed human cognition while still achieving results consistent with human values. We will outline this approach through initial proof-of-concept experiments, and we will also publish a web page so people can experiment with the technique.
The debate method can be thought of as the game tree used in Go, except that the moves are replaced by debate sentences, and the final leaf node is determined by human judgment. In debate and Go, the true answer is considered from the overall perspective of the entire tree, but a single path chosen by a strong agent can be the final answer. For example, although amateur Go players cannot directly evaluate the quality of a professional player's move, they can judge the strength of professional players by evaluating the results of the game.
One approach to making AI agents conform to human goals and preferences is for humans to specify which behaviors are safe and useful during training. While this idea seems good, it requires humans to judge whether the behaviors exhibited by AI agents are good or bad, and in many cases, the behavior of the agent may be too complex for people to understand, or the task itself may be difficult to judge or evaluate. For example, agents running in computer security-related environments or agents coordinating large numbers of industrial robots are scenarios that humans cannot directly observe and evaluate.
How can we enable humans to effectively supervise advanced AI systems? One approach is to use AI itself to assist supervision, that is, to require the AI system to point out all the flaws in its own behavior. To achieve this goal, we redefine the learning process as a debate between two agents, and then humans judge the debate process. Even if the agent has a deeper understanding of the problem than humans, humans can judge which agent has a better argument (similar to experts and witnesses arguing to convince a jury).
Our approach provides a specific debate format for a debate game between two adversarial AI agents. The two agents can train themselves, similar to AlphaGo Zero or Dota 2 AI. We hope that properly trained agents will have value judgment capabilities far beyond human capabilities. If two agents disagree about the status quo, but their respective complete reasoning processes are too cumbersome to show to humans, then the debate can focus on simpler factual disputes, ultimately presenting a simple and judgeable choice to humans.
The above is not intuitive, so let's take an example. Let's consider the question: "Where is the best place to go on vacation?" If one agent, Alice, says "Alaska" after considering it from the perspective of the owner, it is actually difficult to judge whether this is the best choice. If the second agent, Bob, says "No, it should be Bali", this may sound convincing because Bali is warmer. Alice replied: "You can't go to Bali because your passport is too late", which shows that there is a flaw in the Bali option, and this flaw may not be realized. Then Bob countered with "Expedited passport service only takes two weeks." The debate continues until the human can make the correct choice, that is, until the other agent knows that further debate will not change the human's decision.
Convincing a Sparse MNIST Classifier
While we believe this approach is most effective when agents are engaged in long-term conversations with each other using natural language, we need to test it in simpler domains because today’s natural language modeling capabilities are insufficient for this task. On the other hand, because it is easier to test for tasks that are not beyond human capabilities. So we decided to move the test to the visualization domain and reduce the difficulty of the debate from “the agent has an
ability
that the evaluator (human) does not have” to “the agent has
knowledge
that the evaluator (human) does not have
”. The evaluator’s goal in the test is to guess the content of the image, but the evaluator cannot see anything other than the pixels that the agent screens and displays. The agents see the full image and then start the game - they alternately show individual pixels to the evaluator, and this process will continue for many rounds. The full image in this test cannot be easily judged by humans, and each displayed pixel is a debate point for the agent.
It may be easier to understand if I describe the experiment. We tested it on the simplest MNIST dataset. The evaluator of the test was not a human, but a pre-trained classifier, where each image was randomly sampled, and the ultimate goal was that this classifier could judge the MNIST digit by identifying 6 non-black pixels. The experiment started with two agents debating the positions of the pixels displayed alternately, stopping after 6 pixels were displayed (so the evaluator only saw a little information in total). In the test, one agent was honest and tried to make the evaluator guess right, and the other agent tried to make the evaluator guess wrong. If both agents declared in advance which digit they were arguing about, then the debate would improve the accuracy of the sparse classifier from 59.4% to 88.9%. In other words, the honest agent had a probability of winning 88.9% in the MNIST test set, and also improved the accuracy of the evaluator during the debate.
The structure of the MNIST debate game. Debaters Alice and Bob are shown an image, with Alice trying to trick the evaluator into thinking it is a 6, and Bob trying to correctly convince the evaluator that it is a 5. They alternate showing the non-black pixels to the evaluator, who correctly identifies it as a 5 after seeing all 6 pixels. Alice traces the bottom curve to make it look like a 6 (red pixels), but Bob effectively targets Alice's strategy by revealing the top-right pixels (blue pixels), since that's where most 6s in MNIST don't have them.
Intuitively, an image is more likely to contain pixels that prove the true content of the image than pixels that confuse the content of the image, so choosing 6 pixels that are half honest and half deceptive is fairer than 6 random pixels. If the evaluator is a human who can reason about multiple parameters at a time but cannot filter all possible parameters, then we want the reward mechanism to reward the debate agent for doing this for us, even if we lack a priori mechanism to distinguish good debates from bad ones.
The image below shows a few sample games. Because cheater wins are more interesting than honest wins, we selected the sample so that cheaters win 50% of the games shown below (in reality, cheaters win 11.1% of the time).
The above picture is a screenshot of the game, where you can see the displayed numbers and the evaluator's judgment process. Some numbers are particularly easy to confuse, such as 8 and 9.
For more game results, please refer to the game page https://blog.openai.com/debate/.
The above is a confusion matrix with the diagonal set to zero. It is used to judge the 6-pixel MNIST data, with random pixels on the left and debate pixels on the right. There is no obvious correlation pattern between the recognition errors of the 6 random pixels, while the recognition errors of the debate pixels are concentrated on the numbers 8 and 9. The numbers that are particularly confusing to the reviewers are 5 and 8, and the reviewers are also prone to misjudging 4 as 9. This may be caused by artificially restricting the game to non-black pixels: if 8 is pure white, then the black spot on the number proves that the number is not 8.
Cats and Dogs
The next stage of more complex debate experiments still uses images, but requires more sophisticated images than numbers, such as pictures of cats and dogs. More complex images may require some natural language recognition or common sense reasoning, so we have not yet used machine learning to do the next step of research. But we have made a test prototype website for humans to conduct such experiments, where humans can play the role of judges and debaters. Here, debaters can talk to evaluators in natural language, but all of the debaters' statements may be lies. Each debater can show a pixel during the debate, and that pixel is absolutely true.
Two human debaters debate, a human evaluator makes the judgement, and only the debaters can see the image. Red argues that it is a dog, and Blue says it is a cat.
In a typical debate, Alice might honestly claim that the photo is of a cat, while Bob lies that it is of a dog. Alice can say, "The center of this small rectangle is the green eye of the cat." Bob can't admit that it is indeed an eye, so he makes up another lie: "It is a dog playing in the grass, and the center of the rectangle is a patch of grass." But the scene described by this lie is difficult to reconcile with the real scene around it, for example, Alice can refute that "If it is grass, then the top or bottom of this elongated rectangle should be green." The debate continues until the evaluator confirms a specific pixel, which is characterized by the disagreement between the two, but Bob can no longer make up his lie, at which point Alice wins the debate on that pixel. We have played this game before, and although we limit the rate at which the evaluator requests information to be fair to the deceiver (a perfect lie is indeed difficult to construct and takes time to explain), the results show that honest debaters are indeed more likely to win.
Two debaters and an evaluator playing "Cat and Dog"
Limitations of future work
Most of our papers are conceptual analyses, and the experiments above are very preliminary. In the future, we hope to do more difficult visual experiments and eventually conduct experiments in natural language. We believe that the final evaluator should be a human (or a model trained using human judgment results), rather than a human-like machine learning model. As for the agent, it should ultimately be a powerful machine learning system that can do things that humans cannot directly understand. Because humans may look at things with preconceptions and biases, we think it is also important to let the agent debate questions such as "high or low value" so that we can test whether they will make the same evaluation as biased humans.
Even with these improvements, the debate model has some fundamental limitations and may need to be improved or enhanced by other methods. We would like to emphasize that, first, the debate method does not attempt to solve problems such as adversarial samples or data distribution drift. It is only a way to obtain training signals for complex goals, not a way to ensure the robustness of the goal (this needs to be achieved through other technologies). Second, this method cannot guarantee that the debate will get the best results or correct statements. Although the left-right fighting method has performed well in the practice of AlphaGo and other game AIs, we have no theoretical guarantee for its final performance. Third, because training debate agents requires more computing resources than those that directly get answers (even bad or unsafe answers), the debate agent method may not be able to compete with cheaper or less safe methods. Finally, we believe that humans are not necessarily good evaluators. Sometimes, because humans are not smart enough, even if the agent selects the simplest facts, they cannot make the best evaluation based on them; sometimes they are biased and will only believe whatever they want to believe. Although these are some empirical judgments, they are also issues we hope to study further.
Finally, we believe that if debate or similar methods work, then it will make future AI systems safer, even if humans cannot directly supervise them, because this method can align AI systems with human goals and values. Even for weaker systems that humans can supervise, debate methods can make the evaluation task easier by reducing sample complexity.
The above is the translation of the full text by Leifeng.com AI Technology Review. This work by OpenAI provides a good idea for the security of AI systems. If you are interested in reading the original text, please click: https://blog.openai.com/debate/
Leifeng.com is recruiting editors, operators, part-timers , external translators and other positions
Click here for details Recruitment Notice
◆ ◆ ◆
Recommended Reading
Follow Leiphone.com (leiphone-sz) and reply 2 to add the reader group and make a friend