The open source version of SearchGPT is here. It can be reproduced with two 3090 images, surpassing the paid version of Perplexity

Latest update time：2024-11-12

Reads：

VSA team contribution
Quantum Bit | Public Account QbitAI

Just a few days after OpenAI launched SearchGPT, the open source version was also released.

The MMLab of CUHK, Shanghai AI Lab, and Tencent team have simply implemented Vision Search Assistant . The model design is simple and can be reproduced with only two RTX3090s .

Vision Search Assistant (VSA) is based on the Visual Language Model (VLM) and cleverly integrates Web search capabilities into it, allowing the knowledge within the VLM to be updated in real time, making it more flexible and intelligent.

Currently, VSA has been experimented on general images, with good visualization and quantification results. However, different categories of images have their own characteristics, and more specific VSA applications can be built for different types of images (such as tables, medicine, etc.).

What’s even more exciting is that the potential of VSA is not limited to image processing. There is a broader space to explore, such as video, 3D models, and sound, and we look forward to pushing multimodal research to a new level.

Let VLM handle unseen images and new concepts

The emergence of large language models (LLMs) allows humans to leverage the model’s powerful zero-shot question-answering capabilities to acquire unfamiliar knowledge.

On this basis, techniques such as retrieval-augmented generation (RAG) further improve the performance of LLM in knowledge-intensive, open-domain question answering tasks. However, when VLMs are faced with unseen images and new concepts, they often fail to make good use of the latest multimodal knowledge from the Internet.

Existing Web Agents mainly rely on the retrieval of user questions and summarizing the HTML text content returned by the retrieval, so they have obvious limitations when dealing with tasks involving images or other visual content, that is, visual information is ignored or insufficiently processed.

To solve this problem, the team proposed Vision Search Assistant. Based on the VLM model, Vision Search Assistant can answer questions about unseen images or new concepts. Its behavior is similar to the process of humans searching and solving problems on the Internet, including:

Understanding the query
Decide which objects in an image should be of interest and infer relevance between objects
Generate query text per object
Analyze search engine returns based on query text and inferred relevance
Determine whether the visual and textual information obtained is sufficient to generate an answer, or it should iterate and improve the above process
Combine search results to answer user questions

Visual content description

The visual content description module is used to extract object-level descriptions and correlations between objects in the image. Its process is shown in the figure below.

First, we use an open-domain detection model to obtain interesting image regions. Then, for each detected region, we use VLM to obtain object-level textual descriptions.

Finally, to express the visual content more comprehensively, VLM is utilized to further associate different visual regions to obtain a more precise description of different objects.

Specifically, let the user input image be and the user's question be . An open domain detection model can be used to obtain regions of interest:

Then use the pre-trained VLM model to describe the visual content of this area respectively :

In order to associate the information of different regions and improve the accuracy of the description, the description of region can be spliced with that of other regions , and VLM can correct the description of region :

At this point, an accurate description of the visual regions that are highly relevant to the user input is obtained .

Web Knowledge Search: “Search Chain”

The core of Web knowledge search is an iterative algorithm called “search chain”, which aims to obtain comprehensive Web knowledge of relevant visual descriptions. The process is shown in the figure below.

In Vision Search Assistant, LLM is used to generate sub-questions related to the answer, and this LLM is called "Planning Agent". The pages returned by the search engine are analyzed, selected and summarized by the same LLM, which is called "Searching Agent". In this way, Web knowledge related to visual content can be obtained.

Specifically, since the search is performed on the visual content description of each region separately, the region is taken as an example and the superscript is omitted, i.e. , the same LLM model is used in this module to construct the planning agent and the search agent. The planning agent controls the process of the entire search chain, and the search agent interacts with the search engine to filter and summarize web page information.

Taking the first iteration as an example, the decision agent splits the problem into search sub-problems and hands them over to the search agent. The search agent delivers each one to the search engine and obtains a set of pages . The search engine reads the page summary and selects the set of pages most relevant to the problem (subscript set is ), as follows:

For these selected pages, the search agent will read their contents in detail and summarize:

Finally, the summaries of all the sub-problems are fed to the decision agent, which summarizes the Web knowledge after the first round of iterations:

Repeat the above iterative process times, or when the decision agent believes that the current Web knowledge is sufficient to answer the original question, the search chain stops and the final Web knowledge is obtained .

Collaborative Generation

Finally, based on the original image , visual description , and Web knowledge , VLM is used to answer the user's question . The process is shown in the figure below. Specifically, the final answer is:

Experimental Results

Visualization comparison of open set question answering

The figure below compares the open set question answering results for new events (first two rows) and new images (last two rows).

Comparing Vision Search Assistant with Qwen2-VL-72B and InternVL2-76B, it is not hard to find that Vision Search Assistant excels in generating newer, more accurate, and more detailed results.

For example, in the first sample, Vision Search Assistant summarizes the situation of Tesla in 2024, while Qwen2-VL is limited to information from 2023, and InternVL2 explicitly states that it cannot provide real-time information about the company.

Open Set Question Answering Evaluation

In the open set question answering evaluation, a total of 10 human experts were compared and evaluated on 100 image-text pairs collected from news between July 15 and September 25, covering all areas of novel images and events.

Human experts evaluated the three key dimensions of authenticity, relevance, and support.

As shown in the figure below, Vision Search Assistant performs well in all three dimensions compared to Perplexity.ai Pro and GPT-4-Web.

Factualism: Vision Search Assistant scored 68%, outperforming Perplexity.ai Pro (14%) and GPT-4-Web (18%). This significant lead shows that Vision Search Assistant consistently provides more accurate and fact-based answers.
Relevance: Vision Search Assistant has a relevance score of 80%, showing a significant advantage in providing highly relevant answers. In contrast, Perplexity.ai Pro and GPT-4-Web reached 11% and 9% respectively, showing a significant gap in keeping web searches timely.
Supportability: Vision Search Assistant also outperformed other models in providing sufficient evidence and justification for its responses, with a supportability score of 63%. Perplexity.ai Pro and GPT-4-Web lagged behind with scores of 19% and 24%, respectively. These results highlight Vision Search Assistant’s superior performance on open-set tasks, particularly in providing comprehensive, relevant, and well-supported answers, making it an effective approach for handling novel images and events.

Closed-Set Question Answering Evaluation

A closed-set evaluation is conducted on the LLaVA W benchmark, which contains 60 questions covering the conversation, detail, and reasoning abilities of the VLM in the wild.

The GPT-4o(0806) model is used for evaluation, using LLaVA-1.6-7B as the baseline model. The model is evaluated in two modes: standard mode and “naive search” mode using a simple Google image search component.

In addition, an enhanced version of LLaVA-1.6-7B equipped with a search chain module was evaluated.

As shown in the table below, Vision Search Assistant showed the strongest performance in all categories. Specifically, it scored 73.3% in the Conversation category, a slight improvement of +0.4% over the LLaVA model. In the Details category, Vision Search Assistant stood out with a score of 79.3%, +2.8% higher than the best performing LLaVA variant.

In terms of reasoning, the VSA method outperforms the best performing LLaVA model by +10.8%. This shows that Vision Search Assistant’s advanced integration of visual and textual search greatly enhances its reasoning capabilities.

The overall performance of Vision Search Assistant is 84.9%, which is +6.4% higher than the baseline model. This shows that Vision Search Assistant performs well in both dialogue and reasoning tasks, giving it a clear advantage in question-answering capabilities in the wild.

Paper: https://arxiv.org/abs/2410.21220
Homepage: https://cnzzx.github.io/VSA/
Code: https://github.com/cnzzx/VSA

-over-

Please send your submissions to:

ai@qbitai.com

Please indicate [Submission] in the title and tell us:

Who are you, where are you from, what is your contribution ‍

Attach the link to the paper/project homepage and contact information

We will (try to) reply you promptly

Click here ???? Follow me, remember to mark the star~