Easily understand 4K high-definition images! This large multi-modal model automatically analyzes the content of web posters, which is not very convenient for workers.

Latest update time：2024-04-21

Reads：

Chen Lin contributed from Aofeisi
Qubit | Public account QbitAI

A large model that can automatically analyze the content of PDFs, web pages, posters, and Excel charts is not too convenient for workers.

The InternLM-XComposer2-4KHD (abbreviated as IXC2-4KHD) model proposed by Shanghai AI Lab, the Chinese University of Hong Kong and other research institutions makes this a reality.

Compared with other multi-modal large models that have a resolution limit of no more than 1500x1500, this work increases the maximum input image of multi-modal large models to more than 4K (3840 x1600) resolution, and supports any aspect ratio and 336 pixels ~ 4K dynamic resolution changes.

Three days after its release, the model topped the Hugging Face visual question and answer model popularity list.

Easily understand 4K images

Let’s look at the effect first~

The researchers input a screenshot of the home page of the paper (ShareGPT4V: Improving Large Multi-Modal Models with Better Captions) (resolution 2550x3300) and asked the paper which model had the highest performance on MMBench.

It should be noted that this information is not mentioned in the text part of the input screenshot, but only appears in a rather complicated radar chart. Faced with such a tricky question, IXC2-4KHD successfully understood the information in the radar chart and answered the question correctly.

Facing more extreme resolution image input (816 x 5133), IXC2-4KHD easily understands that the image consists of 7 parts and accurately explains the text information content contained in each part.

Subsequently, the researchers also comprehensively tested the capabilities of IXC2-4KHD on 16 multi-modal large model evaluation indicators, of which 5 evaluations (DocVQA, ChartQA, InfographicVQA, TextVQA, OCRBench) focused on the model's high-resolution image understanding capabilities.

Using only 7B parameters, IXC2-4KHD achieved results that were comparable to or even exceeded GPT4V and Gemini Pro in 10 of the evaluations, demonstrating that it is not limited to high-resolution image understanding, but has versatile capabilities for various tasks and scenarios.

△ The performance of IXC2-4KHD with only 7B parameters is comparable to GPT-4V and Gemini-Pro

How to achieve 4K dynamic resolution?

In order to achieve the goal of 4K dynamic resolution, IXC2-4KHD includes three main designs:

(1) Dynamic resolution training:

△ 4K resolution image processing strategy

In the IXC2-4KHD framework, the input image is randomly enlarged to an intermediate size between the input area and the maximum area (not exceeding 55x336x336, equivalent to 3840x1617 resolution) while maintaining the aspect ratio.

Subsequently, the image is automatically cut into multiple 336x336 areas to extract visual features respectively. This dynamic resolution training strategy allows the model to adapt to visual input of any resolution, while also making up for the problem of insufficient high-resolution training data.

Experiments show that as the dynamic resolution upper limit increases, the model achieves stable performance improvement on high-resolution image understanding tasks (InfographicVQA, DocVQA, TextVQA), and still does not reach the upper bound at 4K resolution, showing higher The potential for further expansion of resolution.

(2) Add tile layout information:

In order for the model to adapt to richly varying dynamic resolutions, the researchers found that tile layout information was needed as an additional input. To achieve this, the researchers adopted a simple strategy: a special 'newline' ('\n') token is inserted after each row of tiles to inform the model of the layout of the tiles. Experiments show that adding tile layout information has little impact on dynamic resolution training with relatively small changes (HD9 represents that the number of tile areas does not exceed 9), but can bring significant performance improvements to dynamic 4K resolution training. .

(3) Expanded resolution in the reasoning stage

The researchers also found that models using dynamic resolution can directly expand the resolution during the inference stage by increasing the maximum tile limit, and bring additional performance gains. For example, if the training model of HD9 (up to 9 blocks) is directly tested using HD16, a performance improvement of up to 8% can be observed on InfographicVQA.

IXC2-4KHD increases the resolution supported by multi-modal large models to the 4K level. The researchers said that the current strategy of supporting larger image input by increasing the number of tiles has encountered computational cost and video memory bottlenecks, so they plan to Propose more efficient strategies to achieve higher resolution support in the future.

Paper link:
https://arxiv.org/pdf/2404.06512.pdf Project link: https://github.com/InternLM/InternLM-XComposer