Google releases the latest picture-speaking model, which can achieve zero-sample learning and can directly handle multiple types of tasks
Xingkun from Aofei Temple
Quantum Bit Report | Public Account QbitAI
Google has launched a new weakly supervised image speaking model SimVLM , which can easily achieve zero-shot learning task transfer.
From describing images in words to answering questions about them, the model can do everything without any fine-tuning.
For general visual language pre-training (VLP) models, the training dataset is required to contain a large number of accurate labels. And the task migration of the model requires re-labeling the dataset for specific tasks.
In summary, labeling datasets is not only time-consuming and labor-intensive, but also not applicable to multiple tasks.
Can we develop a VLP model that is both simple and versatile?
This newly developed model by Google uses weakly supervised learning for model training. By utilizing a large number of weakly aligned image-text pairs for modeling, it simplifies the VLP training process and greatly reduces the complexity of training.
SimVLM is trained end-to-end using a single objective of prefix language modeling and directly takes raw images as input. These settings allow the model to exploit large-scale weakly labeled datasets, thus achieving better generalization effects for zero-shot learning.
How is the SimVLM model implemented?
The pre-training process of the SimVLM model adopts a single objective of prefix language modeling (PrefixLM), accepting the prefix of the sequence as input and predicting its continuation through the model decoder.
For image-text pairs in the dataset, the image sequence can be regarded as the prefix of its text description.
This approach can simplify the training process and maximize the flexibility and generality of the model in adapting to different task settings.
The backbone network of the model uses the Transformer architecture, which performs well in both language and vision tasks.
To extract the context patch from the input raw image data, a ResNet convolutional network is used.
As shown in the figure above: In the visual mode, the image is divided into multiple patches and then compressed into a one-dimensional sequence. The textual mode sentence is mapped into a representation vector.
This model uses the ALIGN training set of image-text pairs containing about 1.8B noise to achieve better generalization ability of zero-shot learning.
To compensate for the noise in the training set, the training model also used the Colossal Clean Crawled Corpus (C4) dataset with a total of 800G.
What is the basic performance of the SimVLM model?
After the model is pre-trained, it is necessary to fine-tune the model on multimodal tasks to test its performance.
The multi-modal tasks used here are: VQA, NLVR2, SNLI-VE, COCO Caption, NoCaps and Multi30K En-De.
△ Performance indicators: BLEU-4 (B@4), METEOR (M), CIDEr (C), SPICE (S)
The SimVLM model is compared with existing fully functional models. The test results are shown in the table above. The SimVLM models involved in the evaluation also include three different sizes: 86 million parameters, 307 million parameters, and 632 million parameters.
In the test results of cross-modal tasks, the SimVLM model performed the best (the larger the data, the better). Except for the B@4 indicator of CoCo Caption, it achieved new SOTA results in other tasks, which fully demonstrated the advanced nature of the model.
Zero-shot generalization of the SimVLM model
The SimVLM model can achieve good performance in cross-modal task tests, so can it successfully perform zero-sample cross-modal transfer?
The pre-trained SimVLM models are fine-tuned only on text data or not fine-tuned at all, and the models are tested on tasks such as image captioning, multi-language captioning, open-ended VQA, and visual text generation.
The test results are shown in the figure below:
Given an image and a textual prompt, the pre-trained model can predict the content of the image without fine-tuning.
In addition, the unfine-tuned model performs well in applications such as German subtitle generation, out-of-dataset answer generation, text description based on image content, and open-ended visual question answering.
To quantify the zero-shot learning performance of SimVLM, a pre-trained fixed model is used to decode on COCO Caption and NoCaps, and then compared with the supervised standard baseline (Sup.).
From the comparison of results, we can see that even without supervised fine-tuning, SimVLM can achieve the quality level of supervised training.
about the author
The first author of this study is Google student researcher Wang Zirui, who is currently studying at Carnegie Mellon University. He has published many papers as the first author at top conferences such as ICLR, EMNLP, and CVPR.
As of December 20, 2020, he achieved the first SOTA performance that exceeded human scores on the SuperGLUE dataset (score over 90), and has currently been overtaken by the Baidu team and ranks second.
The SimVLM developed this time also achieved single-model SOTA performance in 6 visual language benchmarks, and realized the generalization capability of text-guided zero-shot learning.
Reference links:
https://arxiv.org/abs/2108.10904
https://ai.googleblog.com/2021/10/simvlm-simple-visual-language-model-pre.html
http://www.cs.cmu.edu/~ziruiw/
-over-
This article is the original content of [Quantum位], a signed account of NetEase News•NetEase's special content incentive plan. Any unauthorized reproduction is prohibited without the account's authorization.
Many AI experts attended the event and invited you to witness the new future of intelligent technology
click here
Featured Posts