There is a solution to future reasoning that even GPT-4V can’t figure out! From Hua University of Science and Technology & Shanghai University of Science and Technology

Latest update time：2023-12-17

Reads：

Contributed by Yu En
Qubit | Public account QbitAI

Multimodal large language models demonstrate powerful image understanding and reasoning capabilities.

But it's still very difficult to get them to make predictive inferences about future events based on current observations .

Even the most powerful GPT-4V (as shown in the figure below) cannot solve this problem well.

△ GPT-4V error cases

Now, teams from the University of Science and Technology of China and Shanghai University of Science and Technology have proposed a learning paradigm that gives multi-modal large language models forward-looking thinking , and built the multi-modal large language model Merlin based on this paradigm .

Merlin is a legendary character in Arthurian legend, famous for his powerful magic and wisdom in Arthurian legend. According to legend, Merlin has the ability to foresee the future and has a deep understanding of destiny.

Let’s see how it is done specifically?

Note: Human beings can deduce events that are about to happen or may happen in the next period of time based on the current observation status. We call this ability forward-looking thinking.

A simple example:

When you watch an NBA game on TV, you can judge the possible next scene based on the status of different players on the field.

For example, when an offensive player breaks through the defender with the ball, we have reason to judge that this player is about to rush to the basket for a layup or dunk.

For another example, when the ball carrier stops at the three-point line and faces the basket, we have reason to predict that this player is about to make a three-point shot (of course, it may also be a fake move in order to shake off the defender and break through).

The Merlin large model can make such predictions.

Method introduction

To explore how to empower forward-looking thinking that inspires multimodal large language models.

We started with an in-depth analysis of how humans predict future events.

We view humans' inferential prediction of future events as a two-stage system.

In the first stage, we will observe the current scene. During the observation process, we will focus on capturing the dynamic clues of the relevant subjects. In the second stage, our brain will analyze the subject's behavior pattern (such as running or running, etc.) based on the obtained dynamic clues. Behavioral intentions, and then infer events that may occur.

Benchmarking the multi-modal large language model, we believe that the second stage can be completed relatively well, thanks to the powerful logical reasoning capabilities of the large language model.

So the problem lies in the first stage, that is, the current multi-modal large language model cannot successfully capture the dynamic information of relevant subjects, thus limiting its ability to reason about future events.

After reaching this conclusion, the next thing we need to do is to explore how to let the multi-modal large language model learn to capture the dynamic clue information of the relevant subject from the current observation .

In order to achieve this goal, a straightforward solution is to let the multi-modal large language model learn to predict all the information of the next frame (that is, to reconstruct the next frame as the optimization goal) .

However, on the one hand, it is difficult to learn, and on the other hand, there is a lot of redundant visual information in images or video sequences, which is not conducive to the model learning to capture the dynamic information of the corresponding subject.

Based on the above analysis, this paper proposes a structured representation of "trajectory" as the optimization goal to establish the dynamic association between the past and the future. We believe that using trajectory as the optimization target has the following benefits:

(1) Trajectory, as a highly structured representation, has strong information condensation and can help the model effectively extract the key dynamic information of the subject in continuous actions, thereby reducing the need for learning redundant visual information and calculating Lower cost.

(2) Trajectories can naturally associate the past and the future together. By learning to predict the trajectory of the subject, the multi-modal large language model must learn to accurately pay attention to the corresponding positions of the corresponding subjects in different frames, which can greatly enhance the model. Alignment capabilities for multiple images and multiple identities (Ids).

Based on these advantages, we design a novel learning framework that focuses on extracting and understanding the subject's motion trajectory from multi-modal inputs (such as images, videos, and text) and predicting it. The details of this framework are as follows:

Inspired by the current mainstream LLM learning paradigm, we also constructed a two-stage learning paradigm, namely Foresight Pre -Training (FPT) and Foresight Instruction-Tuning (FIT) .

In FPT, we first input visual context tokens containing several frames of pictures to the model, then we give the initial observation of the first frame of the relevant subject (initial position, appearance description or action description), and then we ask the model to The entire trajectory of the corresponding subject is predicted based on the initial observation.

By learning to predict the entire trajectory, the model must learn to correctly focus on corresponding subjects in multiple images and capture their dynamic information.

In FIT, some relevant user prompts will be added to conduct conversations about related subjects.

It is worth noting that in order to stimulate forward-looking thinking of the model at this stage, we also designed an instruction interaction form with "trajectory" as the core, which we call Trajectory Chain -of- Thought, T-CoT) .

Specifically, when talking to the model, we will ask the model to output the trajectories of the related subjects mentioned (as shown in the figure above) .

By outputting the entire trajectory, the model is forced to pay attention to the corresponding subjects in multiple graphs, providing sufficient dynamic information for subsequent future event reasoning. For more methodological details, please read the paper.

data structure

After designing our learning paradigm, the next more important thing is to build appropriate data for the model to learn. We have carefully constructed a set of multi-task learning data based on the open source data currently on the market. The data distribution is as follows:

Mainly includes Caption, Referring, Detection, Tracking, Reasoning and Dialogue data *Indicates that the data is only used in the instruction fine-tuning phase (FIT) .

Here Merlin used FPT data constructed from tracking data for the first time to give the model trajectory perception and prediction capabilities.

On the other hand, we also proposed Precise Definition of Task Prompt and Answer Format technology:

Avoid conflicts between multi-task learning and damage to general multi-modal capabilities by telling the large model the specific tasks and output forms.

Our subsequent experiments also show that using this technology allows large models to take into account both the specific ability to learn multi-tasks and the general multi-modal ability.

Demonstration of ability

Combining the above two learning processes and the constructed high-quality data, we built a new general multi-modal large language model, Merlin.

Merlin can support the input of a single image or a multi-frame image sequence, and can complete a series of tasks including detection, tracking, REC, REG, etc.

At the same time, thanks to the FPT and FIT we proposed, Merlin has demonstrated powerful trajectory-based future reasoning capabilities. Here we select some cases to demonstrate Merlin's capabilities. For more test results, please read our paper and subsequent open demos.

experiment analysis

In order to comprehensively evaluate Merlin's various capabilities, we designed a series of performance comparison experiments and property exploration experiments. Here we focus on selecting a few inspiring experiments to share. For more experimental details, please read our paper.

1. Future Reasoning evaluation

Since there is no mature benchmark in the current field that can evaluate multi-modal large language models, this work builds a new Future Reasoning Benchmark based on MMBench.

On this benchmark, Merlin significantly surpassed the existing mainstream multi-modal large models and demonstrated powerful future reasoning capabilities.

2. Trajectory correlation and prediction evaluation

Since Merlin regards the prediction of relevant subject trajectories based on initial observations as a core learning goal in pre-training, in order to more comprehensively evaluate this learning situation, we focused on the downstream task of tracking for evaluation.

This is because trajectory correlation is a core sub-task in the tracking task, and the tracking evaluation indicators can reflect the multi-image and multi-ID alignment capabilities of large models to a certain extent.

From the results, we can see that Merlin, as a general multi-modal large language model, even surpasses some expert models in tracking tasks. It is also worth noting that this is also the first time that a multi-modal large language model can perform tracking-related tasks.

3. Hallucination evaluation

The problem of hallucination is an important research topic in the field of large models. Since the multi-modal large language model introduces visual modalities, the bias caused by the inability to accurately align the subject description and the corresponding visual information further brings more serious hallucinations.

In this article, we conducted an illusion evaluation on Merlin on POPE to evaluate the model’s ability to align images and texts. As shown in the following table:

It can be seen that Merlin has demonstrated strong anti-hallucination capabilities and is significantly ahead of the current mainstream multi-modal large language models. This proves that the forward-thinking training paradigm we proposed can enhance the model's "image recognition" ability and allow the model to reduce the number of errors. Misrecognition of image content and inconsistencies between images and text.

4. Multi-modal comprehensive performance evaluation

Merlin was also evaluated on the current mainstream multi-modal large language model comprehensive capabilities (including MMBench and MMVet) and visual question and answer capabilities (including GQA and VisWiz) evaluation benchmarks.

The evaluation results show that Merlin has achieved very competitive results, demonstrating Merlin's powerful general comprehensive capabilities.

5. Visual analysis

In order to more intuitively demonstrate Merlin's ability to capture dynamic information clues, this article also conducted an interesting visualization experiment. For a specific dialogue question and answer, we compared the word embedding of the trajectory coordinates output by the model with the visual tokens of multi-frame pictures. The attention map is visualized, as shown in the figure below:

We can see that the word embedding of the estimated coordinates output by the model can accurately focus on the corresponding target subject in the corresponding frame.

This visualization result also further proves that "trajectory" is a very good intermediate representation to help multi-modal large language models establish a dynamic association between language descriptions and the corresponding subjects of multi-frame images.

This also explains from another perspective why Merlin has strong multi-modal synthesis capabilities and anti-hallucination capabilities.

Thinking and summarizing

Merlin's work shows us the important role of the structured representation of "trajectory" in helping multi-modal large language models have forward thinking.

Starting from this point, we can further think about what role the bounding box and trajectory play in the learning of multi-modal large language models——

Is it as an intermediate form or can it be used as a separate learning optimization goal?

On the other hand, is the existing coordinate encoding reasonable? Is there any representation that is more suitable for natural language?

I think there are currently no standard answers to these questions and require further in-depth exploration by researchers. Finally, I hope that Merlin’s work can bring some new thinking and understanding to the multimodal large model community. We also welcome everyone to continue to pay attention to our work and communicate more.

Paper:
https://arxiv.org/pdf/2312.00589.pdf