Microsoft has written a GPT-4V manual: 166 pages of complete and detailed explanations, prompt word demo examples are all available | Attached download

Latest update time：2023-10-05

Reads：

Crexi Xiaoxiao from Ao Fei Temple
Qubit | Official account QbitAI

Multi-modal king model GPT-4V, 166-page "instruction manual" is released! And it is produced by Microsoft Team.

What kind of paper can be written in 166 pages?

It not only evaluates the performance of GPT-4V in detail on the top ten tasks , but also demonstrates everything from basic image recognition to complex logical reasoning;

A complete set of multi-modal large model prompt word usage skills are also taught——

It teaches you step by step how to write prompt words from 0 to 1, and the professional level of the answer is easy to understand at a glance. It really makes the threshold for using GPT-4V non-existent.

It is worth mentioning that the author of this paper is also an "all-Chinese class". The seven authors are all Chinese, and the leader is a female chief research manager who has worked at Microsoft for 17 years.

Before the release of the 166-page report, they also participated in the research of OpenAI’s latest DALL·E 3 and have a deep understanding of this field.

Compared with OpenAI's 18-page GPT-4V paper, this 166-page "Eating Guide" was immediately regarded as a must-read for GPT-4V users:

Some netizens lamented: This is not a paper, it is almost a 166-page book.

Some netizens were already panicking after reading:

Don't just look at the details of GPT-4V's answer. I'm really scared of the potential capabilities displayed by AI.

So, what exactly does Microsoft's "paper" talk about, and what "potential" does it show about GPT-4V?

What does Microsoft's 166-page report say?

This paper studies the method of GPT-4V, and its core relies on one word - "try" .

Microsoft researchers designed a series of inputs covering multiple domains, fed them to GPT-4V, and observed and recorded GPT-4V's output.

Subsequently, they evaluated GPT-4V's ability to complete various tasks, and also gave new prompt word techniques for using GPT-4V, including four major aspects:

1. Usage of GPT-4V:

5 ways to use: input images (images), sub-images (sub-images), texts (texts), scene texts (scene texts) and visual pointers (visual pointers).

3 supported capabilities: instruction following, chain-of-thoughts, and in-context few-shot learning.

For example, this is the instruction following ability demonstrated by GPT-4V after changing the questioning method based on the thinking chain:

2. Performance of GPT-4V in 10 major tasks:

open-world visual understanding, visual description, multimodal knowledge, commonsense, scene text understandin, document reasoning, writing Coding, temporal reasonin, abstract reasoning, emotion understanding

Among them are this kind of "image reasoning questions" that require some IQ to solve:

3. Tip word skills for large multi-modal models similar to GPT-4V:

A new multi-modal prompt word technique "visual referring prompting" is proposed, which can indicate the task of interest by directly editing the input image and used in combination with other prompt word techniques.

4. Research and implementation potential of multi-modal large models:

Two types of areas that multimodal learning researchers should focus on are predicted, including implementation (potential application scenarios) and research directions.

For example, this is one of the possible scenarios for GPT-4V found by researchers - fault detection:

But whether it is the new prompt word technology or the application scenarios of GPT-4V, what everyone is most concerned about is the true strength of GPT-4V.

Therefore, this "instruction manual" subsequently used more than 150 pages to show various demos, detailing the capabilities of GPT-4V in the face of different answers.

Let’s take a look at how far GPT-4V’s multi-modal capabilities have evolved today.

Proficient in images in professional fields, you can also learn knowledge instantly

Image Identification

The most basic identification is of course a piece of cake, such as celebrities from all walks of life in technology, sports and entertainment circles:

And not only can you see who these people are, but you can also interpret what they are doing. For example, in the picture below, Huang is introducing Nvidia's new graphics card products.

In addition to people, landmarks are also a piece of cake for GPT-4V. It can not only determine the name and location, but also give detailed introductions.

△ Left: Times Square in New York, right: Kinkakuji Temple in Kyoto

However, the more famous people and places are, the easier it is to judge, so more difficult pictures are needed to show the capabilities of GPT-4V.

For example, in medical imaging, for the following lung CT, GPT-4V gave this conclusion:

Consolidation and ground-glass opacities were present in multiple areas of both lungs, and there may be infection or inflammation in the lungs. There may also be a mass or nodule in the upper lobe of the right lung.

Even without telling GPT-4V the type and location of the image, it can judge it by itself.

In this image, GPT-4V successfully identified it as a magnetic resonance imaging (MRI) image of the brain.

At the same time, GPT-4V also found a large amount of fluid accumulation, which was considered to be a high-grade glioma.

After professional judgment, the conclusion given by GPT-4V is completely correct.

In addition to these "serious" contents, the "intangible cultural heritage" emoticons of contemporary human society have also been captured by GPT-4V.

△ Machine translation, for reference only

Not only can it interpret memes in emoticons, but the emotions expressed by human expressions in the real world can also be seen by GPT-4.

In addition to these real images, text recognition is also an important task in machine vision.

In this regard, GPT-4V can not only recognize languages spelled with Latin characters, but also recognize other languages such as Chinese, Japanese, and Greek.

Even handwritten mathematical formulas:

Image reasoning

The DEMO shown above, no matter how professional or difficult to understand, is still in the scope of recognition, but this is just the tip of the iceberg of GPT-4V's skills.

In addition to understanding the content in the picture, GPT-4V also has certain reasoning capabilities.

To put it simply, GPT-4V can find the differences between the two images (although there are still some errors).

In the following set of pictures, the differences between the crown and the bow were discovered by GPT-4V.

If you increase the difficulty, GPT-4V can also solve the graphics problems in the IQ test.

The characteristics or logical relationships in the above three questions are relatively simple, but the difficulty will arise next:

Of course, the difficulty does not lie in the graphics themselves. Pay attention to the fourth text description in the picture. The arrangement of the graphics in the original question is not what is shown in the picture.

Image annotation

In addition to answering various questions with text, GPT-4V can also perform a series of operations on images.

For example, we have a group photo of four AI giants, and we need GPT-4V to frame the characters and label their names and brief introductions.

GPT-4V first answered these questions with text, and then gave the processed image:

Dynamic content analysis

In addition to these static contents, GPT-4V can also perform dynamic analysis, but it does not directly feed the model a video.

The five pictures below are taken from a tutorial video on making sushi. GPT-4V's task is to guess the order in which these pictures appear (based on understanding the content) .

For the same series of pictures, there may be different ways of understanding them. This is why GPT-4V will make judgments based on text prompts.

For example, in the following set of pictures, whether the person's action is to open the door or close the door will lead to completely opposite sorting results.

Of course, through the changes in the status of the characters in multiple pictures, we can also infer what they are doing.

Or even predict what will happen next:

"On-site learning"

GPT-4V not only has strong visual skills, but the key is that it can be learned and sold immediately.

For example, if GPT-4V is asked to read a car dashboard, the answer initially obtained is wrong:

Then I gave the method to GPT-4V in text, but this answer is still wrong:

Then I showed the example to GPT-4V, and the answer was similar, but unfortunately the numbers were made up randomly.

Only one example is indeed a bit small, but as the number of samples increases (actually there is only one more) , the hard work finally pays off, and GPT-4V gives the correct answer.

GPT-4V only shows so many effects. Of course, it also supports more fields and tasks. It is impossible to show them one by one here. If you are interested, you can read the original report.

So, what kind of team is behind the effects of these artifacts like GPT-4V?

Tsinghua alumni lead the way

There are a total of 7 authors of this paper, all of whom are Chinese, 6 of whom are core authors.

The lead author of the project, Lijuan Wang, is the principal research manager of cloud computing and AI at Microsoft.

She graduated from Huazhong University of Science and Technology and received her PhD from Tsinghua University in China. She joined Microsoft Research Asia in 2006 and Microsoft Research in Redmond in 2016.

Her research field is deep learning and machine learning based on multi-modal perceptual intelligence, which specifically includes visual language model pre-training, image subtitle generation, target detection and other AI technologies.

Original address:
https://arxiv.org/abs/2309.17421

-over-

"AIGC+Vertical Field Community"

Recruiting!

Partners who follow AIGC are welcome to join the AIGC+ vertical community and learn, explore and innovate AIGC together!

Please note the vertical field "education" or "advertising marketing" you want to join. To join the AIGC talent community, please note "talent" & "name-company-position".

Click here ???? Follow me and remember to star~