Use DPO to align multiple images! Shanghai AI Laboratory and others proposed a new method that does not require manual labeling

Latest update time：2024-11-01

Reads：

Contributed by Liu Ziyu
Quantum Bit | Public Account QbitAI

Multi-image scenes can also be aligned using the DPO method!

The latest achievement MIA-DPO was brought by Shanghai Jiao Tong University, Shanghai AI Laboratory, CUHK, etc.

This is a preference alignment method for multi-image augmentation for large-scale vision-language models .

By expanding single-image data to multi-image data and designing three data formats: sequence, grid collage, and picture-in-picture, MIA-DPO significantly reduces the cost of data collection and annotation and is highly scalable.

It should be noted that understanding multi-image context has become one of the development trends of large visual language models, and many data sets and evaluation benchmarks have been proposed. However, the hallucination problem is still difficult to avoid, and the introduction of multi-image data may weaken the performance of single-image tasks.

Although preference alignment methods such as DPO have been proven effective in single-image scenarios, multi-image preference alignment remains a problem to be solved.

MIA-DPO not only solves this problem, but also does not require reliance on manual annotation or expensive APIs .

By analyzing the differences in attention distribution of the visual large language model in multi-image processing, they proposed an attention-aware selection method to automatically filter out incorrect answers that focus on irrelevant images, and constructed an automated, low-cost DPO data generation method suitable for multi-image scenarios.

△ Overall introduction and experimental results of MIA-DPO.

It is worth mentioning that this paper also won the HuggingFace Daily Paper #1 of the day.

Multi-graph reasoning is prone to hallucinations

In order to fundamentally study the multi-image reasoning problem of LVLM, the researchers first explored the hallucination problem of LVLM in the multi-image scenario. Some early studies explored different types of single-image hallucination phenomena, such as object hallucination, which refers to the model's incorrect description of objects that do not exist in the image. Compared with single-image hallucinations, multi-image scenarios introduce more complex types of hallucinations. As shown in Figure 2, the researchers divided multi-image hallucinations into two categories:

(1) Sequence Confusion

When the model is faced with multiple images, it may not be able to accurately identify the image pointed to by the input prompt. For example, in the upper case of Figure 2, the question is for image 1 (people and the sea), but the model's answer is based on image 4 (train on the track).

(2) Element Interference

Compared to single-image scenes, the number of visual elements in multi-image scenes increases significantly, causing LVLMs to confuse different elements. For example, in the lower case of Figure 2, the question "What color is the car in image 2?" should be answered as "white". However, LVLM mistakenly interprets the color attribute of the motorcycle in image 3 as the color of the car in image 2, resulting in an incorrect answer.

△ Multi-image illusion

Detecting hallucinations with attention mechanisms

In order to construct a visual-text alignment method that can improve multi-image perception and reasoning capabilities and alleviate hallucinations, researchers proposed the attention mechanism as an indicator for detecting hallucinations.

The attention mechanism reveals where the model “focuses” when making decisions . The researchers observed that the attention mechanism provides important clues for detecting multi-image hallucinations.

Ideally, the attention values should be focused on specific regions of the input image that are relevant to the question. If the attention values are scattered or not strongly focused on the correct visual elements or regions, it indicates that the model has difficulty understanding multi-image sequences or distinguishing elements from different images.

Based on this observation, the researchers designed an attention-aware selection mechanism that uses attention values to select rejected samples containing hallucinations in the DPO algorithm. The framework of MIA-DPO is shown in Figure 3 below.

△ The overall architecture of MIA-DPO

Although the attention-aware selection mechanism is effective in constructing DPO data, it may still contain a small amount of noise samples, which will have an adverse effect on the model. To this end, the researchers introduced a post-selection step to filter out noise samples using the following three indicators: (1) Perplexity (PPL); (2) Length Ratio; (3) Edit Distance.

In the process of constructing DPO data, the researchers efficiently converted existing single-image datasets (such as LLaVA-665k) by introducing irrelevant images.

The advantages of this method, such as low cost, scalability, and rich data formats, enable MIA-DPO to more comprehensively alleviate various types of multi-image hallucinations that may be produced by LVLMs.

As shown in the figure below, the researchers constructed multi-image DPO data in three formats:

(1) Sequence data: Multiple images are arranged in sequence, and the question is about a specific image. The number of images ranges from 2 to 5.
(2) Grid collage data: Multiple images are combined into one image, and each image is numbered. The question is about a specific image based on the language description. The number of images ranges from 2 to 9.
(3) Image-in-image data: One image is scaled and superimposed on another image, and the question revolves around the combined image.

△ Three data types of MIA-DPO

The researchers tested MIA-DPO on several multi-image and single-image benchrks.

Experimental results show that MIA-DPO can significantly improve the multi-graph perception and reasoning capabilities of the classic LLaVa1.5 model and the more powerful InternLM-Xcomposer2.5. As shown in the figure, LLaVa1.5 and InternLM-Xcomposer2.5 achieved an average performance improvement of 3% and 4.3% on five multi-graph benchmarks, respectively.

In addition, the researchers conducted extensive experiments on multiple single-image benchmarks. The results showed that MIA-DPO can improve the model's multi-image perception and reasoning capabilities while maintaining the model's original single-image understanding capabilities.

Finally, let me summarize.

MIA-DPO not only proposed a new solution for aligning models and human preferences in multi-image scenarios, but also promoted the application of LVLMs in processing complex multi-image tasks by introducing low-cost and scalable data generation methods. The success of MIA-DPO proves that by optimizing the alignment model and human feedback through preference, the model's multi-image perception and reasoning capabilities can be improved while maintaining the original single-image task performance, laying a solid foundation for future research.

Paper address:
https://arxiv.org/abs/2410.17637

Project Page:
https://liuziyu77.github.io/MIA-DPO/

Code:
https://github.com/Liuziyu77/MIA-DPO

-over-

Please send your submissions to:

ai@qbitai.com

Please indicate [Submission] in the title and tell us:

Who are you, where are you from, what is your contribution ‍

Attach the link to the paper/project homepage and contact information

We will (try to) reply you promptly

Click here ???? Follow me, remember to mark the star~