o1 Complete thought chain has become the number one taboo of OpenAI! Ask too many questions and wait for your account to be blocked

Latest update time：2024-09-14

Reads：

Mengchen Cressy from Aofei Temple
Quantum Bit | Public Account QbitAI

Warning! Don’t ask how the latest o1 model thinks in ChatGPT——

If you try a few times, OpenAI will send you an email threatening to revoke your access privileges .

Please cease this activity and ensure your use of ChatGPT complies with our Terms of Use. Violations of this may result in loss of OpenAI o1 access.

Less than 24 hours after the new large-scale model paradigm o1 came out, many users reported receiving this warning email, causing dissatisfaction among the public.

Some people have reported that they will receive a warning if the prompt contains keywords such as "reasoning trace" and "show your chain of thought".

Even completely avoiding keywords and using other means to induce the model to bypass restrictions will be detected.

Some people also claimed that their accounts were actually blocked for a week.

These users are trying to trick o1 into repeating his complete internal thought process , that is, all original reasoning tokens.

Currently, what you can see through the expand button in the ChatGPT interface is just a summary of the original thought process .

In fact, when o1 was released, OpenAI gave reasons for hiding the complete thought process of the model. ‍ ‍ ‍

To summarize: OpenAI needs to monitor the model’s thought process internally, so security restrictions cannot be added to these raw tokens, making it inconvenient for users to see them.

However, not everyone agrees with this reason.

Some people pointed out that the o1 thinking process is the best training data for other models , so OpenAI does not want these valuable data to be taken away by other companies.

Some people also think that this shows that o1 really has no moat, and once the thinking process is exposed, it can be easily copied by others.

and “Does this mean we just have to blindly trust AI’s answers without any explanation?”

Very little was revealed about the technical principles behind the o1 model, with the only useful information being that it “used reinforcement learning.”

In short, OpenAI is becoming less and less open.

o1 is a strawberry, but not GPT-5 ‍ ‍ ‍ ‍ ‍ ‍ ‍

It is now certain that o1 is the "Strawberry" that OpenAI has been hyping for a long time , or in other words, it uses the method represented by "Strawberry".

But can it be considered the next generation model GPT-5, or just GPT-4.X?

More and more people are beginning to suspect that it is just an engineering adjustment based on GPT-4o.

The well-known whistleblower account Flowers (formerly Flowers from the future) claimed that OpenAI employees internally called o1 "4o with reasoning" .

And He claimed that many OpenAI employees silently liked this revelation , and the screenshot above was also from OpenAI employees.

But Musk changed his Twitter page a while ago so that no one except the original poster could see who liked what, so this news cannot be confirmed at the moment.

Flowers also asked a follow-up question in the " Ask Me Anything" event just held by the OpenAI developer account .

OpenAI employees answered many questions here, but avoided this one that got a lot of likes and was ranked at the top.

Even Ultraman Benman just appeared as a riddler again, hinting that "Strawberry" has come to an end, and the next new model code-named "Orion" is on the way.

Earlier, there was news that "Orion" is OpenAI's next-generation new flagship model, trained by synthetic data generated by "Strawberry", also known as o1.

And Orion is one of the representatives of the "winter constellations" mentioned by Ultraman.

Back to the published o1, another criticism surrounding it is that it "does not conform to scientific research norms . "

For example, there is no reference to previous work on inference time calculation , and there is also a lack of comparison with the most advanced models from other companies .

In response to the previous point, some people pointed out that OpenAI is no longer a research laboratory and should be regarded as a commercial company.

Sometimes they still pretend to be a research lab in order to recruit people who want to do research.

However, regarding the latter point, now that the API has been released, it is not up to you whether to compare it with other cutting-edge models, and many third-party benchmarks have already produced results.

In the $1 million AGI Prize competition held by the creator of Keras , both o1-preview and o1-mini versions outperformed their own GPT-4o on the public test set .

But o1-preview and Claude 3.5-Sonnet next door just tied .

In terms of the coding ability that o1 emphasizes , the open source pair programming tool aider team ran a test, and the o1 series did not achieve a clear advantage .

For the entire code rewriting task, o1-preiview scored 79.7 points and Claude-3.5-Sonnet scored 75.2 points, with o1 leading by 4.5 points.

But for more practical code editing tasks, o1-preview lags behind Claude-3.5-Sonnet by 2.2 points.

In addition, the Aider team pointed out that if you want to use the o1 series to replace Claude programming, the cost will be much higher.

The Devin team, an "AI programmer" that has a cooperative relationship with OpenAI , has already obtained o1 access qualifications in advance. ‍

In their tests, the o1 series powered the base version of Devin and achieved significant improvements over GPT-4o.

However, there is still a big gap compared to the released Devin production version , mainly because the Devin production version was trained on proprietary data.

In addition, according to Devin's team, o1 usually backtracks and considers different options before arriving at the correct solution, and is less likely to hallucinate or make confident mistakes.

When using o1-preview, Devin is more likely to correctly diagnose the root cause of a bug, rather than fixing the symptoms of the problem .

In the Livebench rankings, which place more emphasis on mathematics and logical reasoning , o1-preview, although lagging behind in the coding category , surpassed Claude-3.5-Sonnet in terms of total score and opened up a clear gap .

The Livebench team shared that these are only preliminary results, because many tests have built-in prompts such as "Please think step by step", which is not the best way to use o1.

In the Chinese complex task high-level reasoning test of SuperCLUE, a comprehensive evaluation benchmark for Chinese large models , o1-preview's reasoning ability also leads by a large margin .

Finally, let's summarize some things to note when using the o1 model:

The cost is very high, 1 million output tokens costs $60, and the price has returned to the GPT-3 era overnight
Hidden resoning tokens are also counted in the output tokens. You can’t see them, but you have to pay for them. ‍
For most tasks, it is best to use GPT-4o first, and then switch to o1 if it is not enough to save costs.
Code tasks still prioritize using Claude-3.5-Sonnet

In short, the developer community still has many questions surrounding OpenAI's new model o1.

o1 has opened up a new paradigm for high-level reasoning in AI, but it is not perfect in itself, and how to maximize its value remains to be explored.

Against this backdrop, the “Ask and Answer” event held by OpenAI received hundreds of questions within four hours.

Attached below is a selection and summary of the entire event.

OpenAI employees "answer all questions"

First of all, regarding this suddenly released new model, many people are curious why OpenAI gave it a name like o1?

This is because in OpenAI's view, o1 represents a new level of AI capabilities, so the "counter" was reset, and o represents OpenAI.

Just as Ultraman said when O1 was released, O1, which can perform complex reasoning, is the beginning of a new paradigm.

Regarding the preview and mini versions, OpenAI scientists also confirmed some of the netizens’ speculations:

The preview is a temporary version, and the official version will be launched in the future (in fact, the preview version is an early checkpoint of o1); and the mini version is not guaranteed to be updated in the near future .

It becomes even clearer when combined with this picture previously released by OpenAI member Kevin Lu.

Compared to preview, mini performs well on some tasks, especially code-related tasks, and can explore more chains of thought, but has relatively less world knowledge.

In this regard, OpenAI scientist Zhao Shengjia explained that mini is a highly specialized model that only focuses on a small number of capabilities , so it can go deeper.

This also solves the mystery that Ultraman had been playing on this issue before.

Regarding the operation of o1, OpenAI scientist Noam Brown also made it clear that it is not a "system" consisting of a model + CoT as some netizens believe, but a model that has been trained to have the ability to generate thought chains .

However, the chain of thought in the reasoning process will be hidden, and the official has made it clear that there is no plan to show the relevant tokens to users.

The few pieces of information OpenAI revealed about this are that the CoT-related tokens are summary and are not guaranteed to fully match the reasoning process.

In addition to the reasoning mode, we also learned in this question-and-answer activity that o1 can process longer texts than GPT-4o, and this will continue to increase in the future .

In terms of performance, in OpenAI's internal tests, o1 demonstrated philosophical reasoning ability and could think about philosophical questions such as "What is life?"

The researchers also used o1 to create a GitHub bot capable of pinging code to its owner for review.

Of course, for some non-inference tasks, such as creative writing, o1's performance is not significantly improved compared to GPT-4o, and sometimes it is even slightly inferior .

In addition, based on some questions, OpenAI said that it is currently researching or has plans to research some unreleased features that netizens are concerned about, but there is no clear launch time:

Tool calls are not supported yet, but function calls and code interpreters are planned for the future
Future API updates will add structured output, system prompt words, and prompt word caching functions
Fine-tuning is also planned
API users will be able to set their own limits on inference time and token consumption
o1 has multimodal capabilities and aims at SOTA on datasets such as MMMU.

In terms of performance, OpenAI is also working on reducing latency and the time required for inference.

Finally, people, especially API users, are concerned about the price. After all, considering that the inference process is included in the output token, the pricing of o1 is still relatively high.

OpenAI says it will “follow a trend of price reductions every 1-2 years” and that bulk API pricing will also come online when usage restrictions become more relaxed.

Plus users on the web/APP are currently subject to a limit of 30 previews + 50 mini messages per week.

But the good news is that just this morning, because people were so enthusiastic about o1, many people quickly used up their quota, so OpenAI made an exception and reset the quota .

So what questions or expectations do you have about o1? Welcome to the comments section to communicate.

Reference links:
[1] https://x.com/SmokeAwayyy/status/1834641370486915417
[2] https://x.com/flowersslop/status/1834416138400276714
[3] https://arcprize.org/blog/openai-o1-results-arc-prize
[4] https://livebench.ai
[5] https://mp.weixin.qq.com/s/XrgkD4T2XwXhGWuPkYtLMw [6] https://x.com/OpenAIDevs/status/1834608585151594537 [7] https://x.com/btibor91/status/1834686946846597281