Google has increased GPT-4 planning capabilities by 70%, and can do 24-point crossword puzzles, suggesting new results of the project

Latest update time：2023-06-05

Reads：

Cressy comes from Aofei Temple
Qubits | Public account QbitAI

With the support of this project, GPT-4 can complete more complex planning tasks.

Blackjack, creative writing, and even crossword puzzles are all a breeze.

In the past, the accuracy of GPT-4 at 24 points was only 4%, which directly increased to 74%.

This project was jointly created by the newly merged Google DeepMind Laboratory and Princeton University. It is also one of the first results published in the name of the laboratory since the merger.

They proposed a concept called "thinking tree" as an extension of the "thinking chain".

"Thinking Chain" was also launched by Google, which has stimulated the ability of large language models (LLM) to solve complex reasoning problems to a considerable extent.

The emergence of "Thinking Tree" makes up for LLM's inability to solve forward-looking issues such as planning.

Can count 24 points and do crossword puzzles

The team used a total of three tasks: 24 points, creative writing, and crossword puzzles for testing.

Because in their view, these tasks are very difficult even for the current best LLM.

During testing, the performance of the thought tree method was compared with direct questioning and thought chains.

24 o'clock

The team ranked 1,362 questions in a 24-point question bank in order of difficulty.

Difficulty is judged by the time it takes a human to solve the problem.

Afterwards, 100 questions with a certain degree of difficulty numbered 901-1000 were used as test data.

Whether asking questions directly or using the thought chain method, the success rate of GPT-4 is less than 10%.

The success rate of the thinking tree prompt method with width b=1 is 45%, and when b is increased to 5, the success rate reaches 74%.

creative writing

In the creative writing task, the team randomly generated 100 sentences for testing.

Since it is difficult to distinguish right from wrong in writing, the evaluation of this round of tests is divided into two parts.

The first part is to let GPT-4, which has not been specially trained, rate the generated text on a scale of 1-10.

The other part is a manual qualitative comparison of the works under the two prompt methods of thinking chain and thinking tree.

The results show that when no optimization is performed, GPT-4 evaluates the thinking tree method higher than the direct questioning and thinking chain methods.

In manual evaluation, it is believed that thinking trees perform better than thinking chains.

word puzzle

For crossword puzzles, the team selected 156 5×5 games and selected one every fifth number from 1 to 96 as test data.

Additionally, 5 more were used to create prompts.

Finally, the performance of various prompting methods was recorded at the scale of letters, words, and the game as a whole.

From any of the above levels, the thinking tree prompt method performs better, and the success rate at the letter level is as high as 78%.

Looking at the game as a whole, although the success rate of the thinking tree method is only 20%, it is not much better than other methods that have never been successful.

Change the chain into a tree and automatically backtrack at any time

Compared with the thinking chain method, the thinking tree adds steps of thinking disassembly at the beginning.

Depending on the goal, the result could be an equation, a writing plan, or a few sets of words.

The size of the disassembly should be moderate so that LLM can both generate more samples and evaluate itself.

The second step is thinking generation, which provides materials for LLM’s subsequent thinking activities.

This step can be done in two modes: sample and command.

The former is suitable for situations where the thinking space is relatively broad, while the latter performs better when the thinking space is limited.

The third step is status evaluation, which aims to evaluate the process of solving the problem and provide inspiration for the content and sequence of the next processing of the search algorithm.

Based on the evaluation results, jumps or backtracking can be performed as needed, thereby increasing the success rate of the mission.

Different from traditional programming or learning evaluation methods, the team proposed to let LLM conduct evaluation, which is both efficient and flexible.

Similar to the writing evaluation method, the evaluation mode in this step includes scoring and voting, but the voting here is no longer done manually.

The last step is to run the search algorithm in the thinking tree framework. There are various algorithms that can be used depending on the tree structure.

The team mainly studied two search algorithms, breadth-first and depth-first.

How to experience

First you need to apply for an OpenAI (or other LLM) API.

Of course, you can also call your own model.

After the preparations are ready, clone the GitHub project locally:

git clone https://github.com/kyegomez/tree-of-thoughts

Then open the directory:

cd tree-of-thoughts

Then install the OpenAI (or other) model:

pip install openai

Then create a Python script with the following content:

from tree_of_thoughts import OpenAILanguageModel, CustomLanguageModel, TreeofThoughts, OptimizedOpenAILanguageModel, OptimizedTreeofThoughts

#v1
model = OpenAILanguageModel('api key')

#v2 parallel execution, caching, adaptive temperature
model = OptimizedOpenAILanguageModel('api key')

#choose search algorithm('BFS' or 'DFS')
search_algorithm = "BFS"

#cot or propose
strategy="cot"

# value or vote
evaluation_strategy = "value"

#create an instance of the tree of thoughts class v1
tree_of_thoughts = TreeofThoughts(model, search_algorithm)

#or v2 -> dynamic beam width -< adjust the beam width [b] dynamically based on the search depth quality of the generated thoughts
tree_of_thoughts= OptimizedTreeofThoughts(model, search_algorithm)

input_problem = "What are next generation reasoning methods for Large Language Models"
k = 5
T = 3
b = 5
vth = 0.5

#call the solve method with the input problem and other params
solution = tree_of_thoughts.solve(input_problem, k, T, b, vth, )

#use the solution in your production environment
print(solution)

You can also integrate your own models:

class CustomLanguageModel(AbstractLanguageModel):
    def __init__(self, model):
        self.model = model

    def generate_thoughts(self, state, k):
        #implement the thought generation logic using self.model
        pass

    def evaluate_states(self, states):
        #implement state evaluation logic using self.model
        pass

Then just run the script.

Paper address:
https://arxiv.org/abs/2305.10601
GitHub page:
https://github.com/ysymyth/tree-of-thought-llm

-over-

"AIGC+Vertical Field Community"

Recruiting!

Partners who follow AIGC are welcome to join the AIGC+ vertical community and learn, explore and innovate AIGC together!

Please note the vertical field "education" or "e-commerce retail" you want to join. To join the AIGC talent community, please note "talent" & "name-company-position".

click here