Move and draw at the same time, and you will become a second dimension! The black technology participated by the Chinese guy: real-time interactive video stylization

Latest update time：2021-08-31 10:32

Reads：

Yuyang Thirteen from Aofei Temple
Quantum Bit Report | Public Account QbitAI

Animation, animation, you move yours, I draw mine.

Just like the GIF below, there is a static picture on the left. As the artist outlines the colors for it bit by bit, the dynamic picture on the right also changes colors in real time.

This is the black technology from the Czech Technical University in Prague and Snap Institute - ** Only 2 specific frames are needed to change the color, style and even style of objects in the video in real time.

Of course, there’s more to come.

Take a cartoon avatar picture of yourself and modify it at will. With this avatar on your head, you sitting in front of the camera will also change in real time.

You can even draw yourself and watch yourself slowly turn into an animation.

It can be said that the animation comes out when the animation is moving here and drawing there .

Moreover, the entire process does not require a lengthy training process or a large-scale training data set. The research has also been submitted to SIGGRAPH 2020.

So, how is such a magical effect achieved?

Interactive video stylization

First, a video sequence I consisting of N frames is input.

As shown in the figure below, for any frame I _i , you can choose to use the mask Mi _to define the area for style transfer, or perform style transfer on the entire frame.

All the user needs to do is provide stylized keyframes Sk _, whose style will be transferred to the entire video sequence in a semantically meaningful way.

Unlike previous methods, this style transfer is performed in a random order, without waiting for the previous frames to be stylized first, nor does it require explicit merging of stylized content from different keyframes.

That is, the method is actually a translation filter that can quickly learn the style from several heterogeneous hand-drawn examples Sk _and "translate" it to any frame in the video sequence I.

This image conversion framework is based on U-net. In addition, the researchers adopted a patch-based training method and a solution to suppress video flicker , which solved the problems of few-sample training and temporal consistency.

Patch-based training strategy

Key frames are few-sample data. In order to avoid overfitting, the researchers adopted a training strategy based on image blocks.

A set of image patches (a) are randomly sampled from the original keyframe (Ik ₎ , and their stylized counterparts (b) are generated in the network.

Then, the loss of these stylized counterparts (b) relative to the corresponding image patches sampled from the stylized keyframe (S _k ) is calculated and the error is back-propagated.

Such a training scheme is not limited to any specific loss function. In this study, a combination of L1 loss, adversarial loss and VGG loss was used.

Hyperparameter Optimization

After overfitting is resolved, there is another problem, which is the optimization of hyperparameters. Improper hyperparameters may lead to poor inference quality.

The researchers used a grid search method to sample the 4-dimensional space of hyperparameters: Wp - the size of the training image block; Nb - the number of blocks in a batch; α - the learning rate; Nr - the number of ResNet blocks.

For each hyperparameter setting: (1) perform training for a given time; (2) perform inference on the unseen frames; (3) compute the loss between the inferred frames (O ₄ ) and the true values (GT _{4
).}

The goal is to minimize this loss.

Improve time consistency

After the translation network is trained, video style transfer can be implemented in real time or in parallel on the graphics card.

However, the researchers found that in many cases, video flicker was still noticeable.

The first reason is the presence of temporal noise in the original video. To this end, the researchers used a motion compensated variant of the bilateral filter that operates in the time domain.

The second reason is the visual ambiguity of stylized content. The solution is to provide an additional input layer to improve the network's discriminative ability.

This layer consists of a sparse set of random 2D Gaussian distributions, which helps the network identify local context and suppress ambiguity.

However, the researchers also mentioned the limitations of the method:

When new features that have not been stylized appear, this method usually cannot generate consistent stylization effects for them. Additional keyframes need to be provided to make the stylization consistent.

It is difficult to handle high-resolution (such as 4K) keyframes

Using motion-compensated bilateral filters and creating random Gaussian mixture layers requires acquiring multiple video frames, which places higher demands on computing resources and affects the effectiveness of real-time reasoning in real-time video streams. (In the real-time capture session of the demo, no processing method to improve temporal consistency was used.)

research team

The research was conducted by Ondřej Texler, a third-year PhD student at the Department of Computer Graphics and Interaction at the Czech Technical University in Prague.

He also received his bachelor's and master's degrees here. His main research interests are computer graphics, image processing, computer vision, and deep learning.

In addition to the first author, we also found a Chinese author, Chai Menglei, who graduated with a Ph.D. from Zhejiang University and is currently a senior research scientist in the Creative Vision group of Snap Research.

His research interests mainly focus on computer vision and computer graphics, with a focus on human digitization, image processing, 3D reconstruction, and physics-based animation.

Portal

Project address:
https://ondrejtexler.github.io/patch-based_training/

-over-

Special benefits | One-stop audio and video solutions

Want to catch up with the trend of live e-commerce, online education, and mini-program live streaming? Tencent Cloud audio and video solutions can help you!

Tencent Cloud launched a 9.9 yuan product experience package, including cloud on-demand, cloud live broadcast, real-time audio and video, there is always one suitable for you. Scan the QR code to experience: