This video went viral on the Internet. Google made AI video fraud too easy.
Jin Lei Pine from Ao Fei Si
Qubit | Public account QbitAI
Dear friends, the matter of AI making videos has been pushed to the forefront of public opinion again today.
The reason was that someone posted a video of such a little penguin online:
And this nearly 50-second video was born from just 6 sentences !
One after another, netizens are still publishing other masterpieces of this AI:
The prompt words for feeding it this time are also very short, only 4 lines :
Such a "what you write is what you get" and smooth and coherent video generation method has also made many netizens sigh:
The future is here.
Some people have even started to "generate hatred", saying that AI is destroying the industry in various ways...
Then many people asked: "Which new AI company is this?"
However, sharp-eyed netizens discovered that it is actually an "old friend"——
Google released a text-to-video (Text-to-Video) model in October last year: Phenaki .
With just one prompt word, a two-minute video can be generated .
Compared with when Phenaki was first released, Google launched another wave of new operations .
So let’s take a look at these new videos now~
Videos generated by typing
Different from previous AI-generated videos, the biggest feature of Phenaki is that it has a story and length .
For example, let us describe this scene:
In a futuristic city with complicated traffic, an alien spacecraft arrives in the city.
As the camera zooms in, the screen enters the interior of the spacecraft; then the camera continues to move forward along the interior corridor until an astronaut is seen typing on the keyboard in the blue room.
The camera gradually moved to the left side of the astronaut, and a blue ocean appeared behind him, with fish swimming in the water; the screen quickly zoomed in and focused on a fish.
Then the camera quickly emerged from the sea until it saw a future city with towering skyscrapers; the camera quickly zoomed in to an office that hit the building.
At this time, a lion suddenly jumped on the desk and started running; the camera first focused on the lion's face, and when it zoomed out again, the lion had transformed into an "orc" in a suit and leather shoes.
Finally, the camera pulls out from the office to get a bird's eye view of the city under the setting sun.
Presumably many friends have already had corresponding pictures in their minds while reading this paragraph.
Next, let’s take a look at the effect of Phenaki generation:
Does it match the picture you imagined?
Generally speaking, this AI has achieved a seamless transition even when faced with such imaginative scene prompts.
No wonder netizens exclaimed after watching this video, " (Technology) is developing so fast."
For shorter prompt words, Phenaki is even better.
For example, feed Phenaki this text:
A realistic teddy bear is diving; then it slowly surfaces and walks onto the beach; then the camera zooms out to show the teddy bear walking by a bonfire on the beach.
Not enough? Let’s do another paragraph, this time with a different protagonist:
On Mars, the astronaut walked through a puddle, and his silhouette was reflected in the water; he danced next to the water; then the astronaut started walking his dog; and finally he and the dog watched Mars and fireworks together.
When Google released Phenaki earlier, it also demonstrated the ability to generate a video by inputting an initial frame and a prompt word into Phenaki.
For example, given a static image like this:
Then give it Phenaki a simple "feeding" sentence: the white cat touches the camera with its paw. The effect comes out:
Still based on this picture, change the prompt word to "A white cat yawns", and the effect will be like this:
Of course, you can also switch the overall style of the video at will:
Netizen: Will the video industry be impacted by AI?
But in addition to Phenaki, Google also released Imagen Video at that time, which can generate high-definition video clips with 1280*768 resolution and 24 frames per second.
It is based on the image generation SOTA model Imagen and demonstrates three special capabilities:
-
Can understand and generate works of different artistic styles, such as watercolor, pixel and even Van Gogh style
-
Able to understand the 3D structure of objects
-
Inherited Imagen's ability to accurately depict text
Earlier, Meta also released Make-A-Video, which can not only convert videos through text, but also generate videos based on images, such as:
-
Convert still images to video
-
Frame insertion: Generate a video based on two pictures before and after
-
Generate a new video based on the original video
...
Some people are worried about the sudden emergence of generative video models:
Of course, some people think that the time has not yet come:
0-1 will always be fast, but 1-100 will still be long.
However, some netizens are already looking forward to relying on AI to win Oscars:
How long will it take for AI to become the new video editor, or win an Oscar?
Principle introduction
Going back to Phenaki, many netizens are curious about how it generates such smooth videos through text?
Simply put, compared to previous generative video models, Phenaki pays more attention to the arbitrary length of time and coherence .
Phenaki's ability to generate videos of arbitrary lengths of time is largely due to a new encoder-decoder architecture: C-ViViT .
It is a causal variant of ViViT capable of compressing videos into discrete embeddings.
You must know that in the past, when obtaining video compression, either the encoder could not compress the video in time, resulting in the final generated video being too short, such as VQ-GAN, or the encoder only supported a fixed video length, and the length of the final generated video could not be adjusted arbitrarily, such as VideoVQVAE.
But C-ViViT is different. It can be said to take into account the advantages of the above two architectures. It can compress videos in the time and space dimensions, and while maintaining autoregression in time, it can also autoregressively generate videos of any length. .
C-ViViT can enable the model to generate videos of any length, but how to ensure the logic of the final video?
This relies on another important part of Phenaki: the two-way Transformer.
In order to save time, the sampling steps are fixed, and different video tokens can be predicted simultaneously during the processing of text prompts.
In this way, combined with the aforementioned, C-ViViT can compress video in the time and space dimensions, and the compressed token has time logic.
In other words, the Transformer that has been masked and trained on these tokens also has temporal logic, and the coherence of the final generated video is naturally guaranteed.
If you want to know more about Phenaki, you can click here to view it.
Phenaki:
https://phenaki.github.io
Reference links:
[1]
https://phenaki.video/
[2]
https://phenaki.research.google/
[3]
https://twitter.com/AiBreakfast/status/1614647018554822658
[4]
https:// twitter.com/EvanKirstel/status/1614676882758275072
-over-
"Artificial Intelligence" and "Smart Car" WeChat communities invite you to join!
Friends who are interested in artificial intelligence and smart cars are welcome to join the exchange group to communicate and discuss with AI practitioners, and not miss the latest industry development & technological progress.
PS. When adding friends, please be sure to note your name-company-position~
click here