Article count:10350 Read by:146647018

Account Entry

Learning the physical world from 2 billion data, the universal world model based on Transformer successfully challenges video generation

Latest update time:2024-01-28
    Reads:
Yun Zhongfa comes from Ao Fei Si
Qubit | Public account QbitAI

Building a world model that can make videos can also be achieved through Transformer!

Researchers from Tsinghua University and Jiji Technology have joined forces to launch a new universal world model for video generation - WorldDreamer.

It can complete a variety of video generation tasks for natural scenes and autonomous driving scenes, such as Vincent videos, graphic videos, video editing, action sequence videos, etc.

According to the team, WorldDreamer is the first in the industry to build a universal scenario world model by predicting Tokens.

It converts video generation into a sequence prediction task, which can fully learn the changes and motion patterns of the physical world.

Visual experiments have proven that WorldDreamer has a profound understanding of the dynamic changes of the general world.

So, what video tasks can it complete, and how effective is it?

Supports a variety of video tasks

Image to Video

WorldDreamer can predict future frames based on a single image.

As long as the first image is input, WorldDreamer treats the remaining video frames as masked visual tokens and predicts these tokens.

As shown in the figure below, WorldDreamer has the ability to generate high-quality movie-level videos.

The resulting video exhibits seamless frame-by-frame motion, similar to the fluid camera movements seen in real movies.

Moreover, these videos strictly adhere to the constraints of the original image, ensuring remarkable consistency in frame composition.

Text to Video

WorldDreamer can also generate videos based on text.

Given only language text input, WorldDreamer considers all video frames to be masked visual tokens and predicts these tokens.

The image below demonstrates WorldDreamer's ability to generate video from text under various style paradigms.

The resulting video adapts seamlessly to the input language, where user input can shape video content, style, and camera movement.

Video Inpainting

WorldDreamer can further implement video inpainting tasks.

Specifically, given a video, the user can specify the mask area, and then the video content of the masked area can be changed according to the language input.

As shown in the figure below, WorldDreamer can replace the jellyfish with a bear, or the lizard with a monkey, and the replaced video is highly consistent with the user's language description.

Video Stylization

In addition, WorldDreamer can stylize videos.

As shown in the figure below, input a video segment in which certain pixels are randomly masked, and WorldDreamer can change the style of the video, such as creating an autumn-themed effect based on the input language.

Synthetic video based on action (Action to Video)

WorldDreamer can also generate videos from driving actions in autonomous driving scenarios.

As shown in the figure below, given the same initial frame and different driving strategies (such as left turn, right turn), WorldDreamer can generate videos that highly comply with the first frame constraints and driving strategies.

So, how does WorldDreamer achieve these functions?

Build a world model with Transformer

Researchers believe that the current state-of-the-art video generation methods are mainly divided into two categories-Transformer-based methods and diffusion model-based methods.

Using Transformer for token prediction can efficiently learn the dynamic information of video signals and reuse the experience of the large language model community. Therefore, the Transformer-based solution is an effective way to learn a general world model.

However, methods based on diffusion models are difficult to integrate multiple modes within a single model and are difficult to expand to larger parameters. Therefore, it is difficult to learn the changes and motion laws of the general world.

However, current world model research is mainly concentrated in the fields of games, robots and autonomous driving, and lacks the ability to comprehensively capture general world changes and motion patterns.

Therefore, the research team proposed WorldDreamer to enhance the learning and understanding of changes and movement patterns in the general world, thereby significantly enhancing the ability to generate videos.

Drawing on the successful experience of large-scale language models, WorldDreamer adopts the Transformer architecture to convert the world model modeling framework into an unsupervised visual token prediction problem.

The specific model structure is shown in the figure below:

WorldDreamer first uses a visual tokenizer to encode visual signals (images and videos) into discrete tokens.

After these Tokens are masked, they are input to the Spatial Temporal Patchwuse Transformer (STPT) module proposed by the research team.

At the same time, text and action signals are encoded into corresponding feature vectors respectively to be input to STPT as multi-modal features.

STPT fully interactively learns visual, language, action and other features internally, and can predict the visual token of the masked part.

Ultimately, these predicted visual tokens can be used to complete a variety of video generation and video editing tasks.

It is worth noting that when training WorldDreamer, the research team also constructed a triplet of Visual-Text-Action (visual-text-action) data. The loss function during training only involves predicting the masked visual token, without additional Supervision signals.

In the data triple proposed by the team, only visual information is necessary, which means that WorldDreamer training can still be performed even without text or action data.

This mode not only reduces the difficulty of data collection, but also allows WorldDreamer to support the completion of video generation tasks without known or only a single condition.

The research team used a large amount of data to train WorldDreamer, including 2 billion cleaned image data, 10 million videos of common scenes, 500,000 high-quality language annotated videos, and nearly a thousand videos of autonomous driving scenes.

The team conducted millions of iterative trainings on 1 billion levels of learnable parameters. After convergence, WorldDreamer gradually understood the changes and motion patterns of the physical world, and acquired various video generation and video editing capabilities.

Paper address:
https://arxiv.org/abs/2401.09985
Project homepage:
https://world-dreamer.github.io/

-over-

Click here ???? Follow me and remember to star~

Three consecutive clicks of "Share", "Like" and "Watching"

Advances in cutting-edge science and technology are seen every day ~


Latest articles about

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号