This video went viral on the Internet. Google made AI video fraud too easy.

Latest update time：2023-01-16

Reads：

Jin Lei Pine from Ao Fei Si
Qubit | Public account QbitAI

Dear friends, the matter of AI making videos has been pushed to the forefront of public opinion again today.

The reason was that someone posted a video of such a little penguin online:

Does it match the picture you imagined?

Generally speaking, this AI has achieved a seamless transition even when faced with such imaginative scene prompts.

No wonder netizens exclaimed after watching this video, " (Technology) is developing so fast."

For shorter prompt words, Phenaki is even better.

For example, feed Phenaki this text:

A realistic teddy bear is diving; then it slowly surfaces and walks onto the beach; then the camera zooms out to show the teddy bear walking by a bonfire on the beach.

Not enough? Let’s do another paragraph, this time with a different protagonist:

On Mars, the astronaut walked through a puddle, and his silhouette was reflected in the water; he danced next to the water; then the astronaut started walking his dog; and finally he and the dog watched Mars and fireworks together.

When Google released Phenaki earlier, it also demonstrated the ability to generate a video by inputting an initial frame and a prompt word into Phenaki.

For example, given a static image like this:

Then give it Phenaki a simple "feeding" sentence: the white cat touches the camera with its paw. The effect comes out:

Still based on this picture, change the prompt word to "A white cat yawns", and the effect will be like this:

Of course, you can also switch the overall style of the video at will:

Netizen: Will the video industry be impacted by AI?

But in addition to Phenaki, Google also released Imagen Video at that time, which can generate high-definition video clips with 1280*768 resolution and 24 frames per second.

It is based on the image generation SOTA model Imagen and demonstrates three special capabilities:

Can understand and generate works of different artistic styles, such as watercolor, pixel and even Van Gogh style
Able to understand the 3D structure of objects
Inherited Imagen's ability to accurately depict text

Earlier, Meta also released Make-A-Video, which can not only convert videos through text, but also generate videos based on images, such as:

Convert still images to video
Frame insertion: Generate a video based on two pictures before and after
Generate a new video based on the original video
...

Some people are worried about the sudden emergence of generative video models:

Of course, some people think that the time has not yet come:

0-1 will always be fast, but 1-100 will still be long.

However, some netizens are already looking forward to relying on AI to win Oscars:

How long will it take for AI to become the new video editor, or win an Oscar?

Principle introduction

Going back to Phenaki, many netizens are curious about how it generates such smooth videos through text?

Simply put, compared to previous generative video models, Phenaki pays more attention to the arbitrary length of time and coherence .

Phenaki's ability to generate videos of arbitrary lengths of time is largely due to a new encoder-decoder architecture: C-ViViT .

It is a causal variant of ViViT capable of compressing videos into discrete embeddings.

You must know that in the past, when obtaining video compression, either the encoder could not compress the video in time, resulting in the final generated video being too short, such as VQ-GAN, or the encoder only supported a fixed video length, and the length of the final generated video could not be adjusted arbitrarily, such as VideoVQVAE.

But C-ViViT is different. It can be said to take into account the advantages of the above two architectures. It can compress videos in the time and space dimensions, and while maintaining autoregression in time, it can also autoregressively generate videos of any length. .

C-ViViT can enable the model to generate videos of any length, but how to ensure the logic of the final video?

This relies on another important part of Phenaki: the two-way Transformer.

In order to save time, the sampling steps are fixed, and different video tokens can be predicted simultaneously during the processing of text prompts.

In this way, combined with the aforementioned, C-ViViT can compress video in the time and space dimensions, and the compressed token has time logic.

In other words, the Transformer that has been masked and trained on these tokens also has temporal logic, and the coherence of the final generated video is naturally guaranteed.

If you want to know more about Phenaki, you can click here to view it.

Phenaki:
https://phenaki.github.io

Reference links:
[1] https://phenaki.video/
[2] https://phenaki.research.google/
[3] https://twitter.com/AiBreakfast/status/1614647018554822658
[4] https:// twitter.com/EvanKirstel/status/1614676882758275072

-over-

"Artificial Intelligence" and "Smart Car" WeChat communities invite you to join!

Friends who are interested in artificial intelligence and smart cars are welcome to join the exchange group to communicate and discuss with AI practitioners, and not miss the latest industry development & technological progress.

PS. When adding friends, please be sure to note your name-company-position~

click here

Latest articles about

■Domestic 4o large model, understand the national style Li Ziqi in seconds

■The search engine for life is free to use, the open source version of Harry Potter's "Pensieve" is on the GitHub hot list, and it supports Chinese

■iPad can use AI painting interactive editing tool to become popular, netizens: tremble PS

■Real data for various tasks, large-scale online shopping benchmark Shopping MMLU open source｜NeurIPS&KDD Cup 2024

■Scheduled for December 11, registration for the MEET2025 Smart Future Conference has opened!

■2499, AI concentration is off the charts! Wear this pair of glasses, order coffee/real-time translation/AR navigation in one sentence

■Terminus launches its first universal intelligent agent, achieving high-dimensional perception of the physical world

■HKUST's embodied robotics team receives billions of yuan in funding

■ChatGPT paid features are free! Mistral copied Canvas and Artifacts

■Qwen2.5 updates millions of super-long contexts, speeding up inference by 4.3 times. Netizens: RAG is going to be outdated

This video went viral on the Internet. Google made AI video fraud too easy.

Jin Lei Pine from Ao Fei Si Qubit | Public account QbitAI

Netizen: Will the video industry be impacted by AI?

Principle introduction

Latest articles about

Jin Lei Pine from Ao Fei Si
Qubit | Public account QbitAI