New SOTA for video retouching: inference speed is nearly 15 times faster, and portrait disappearance can be performed at any resolution, from Nankai University｜CVPR 2022

Latest update time：2022-05-18

Reads：

Fengse from Aofei Temple
Quantum Bit | Public Account QbitAI

As we all know, videos can be photoshopped .

Look, in the papers included in CVPR 2022, there is such a photo-editing tool, which can show you various portrait disappearance methods in minutes without leaving any trace.

Removing watermarks and filling in gaps is a piece of cake, and videos of various resolutions can be handled.

As you can see, this model performs so smoothly that it achieves SOTA performance on both benchmark datasets .

△ Comparison with SOTA methods

At the same time, its inference time and computational complexity performance are also eye-catching:

The former is nearly 15 times faster than previous methods and can process 432 × 240 videos at 0.12 seconds per frame on a Titan XP GPU; the latter achieves the lowest FLOPs score among all compared SOTA methods .

What is the origin of such a magical weapon?

Improving Optical Flow

Currently, many video restoration algorithms use the optical flow method .

That is, by utilizing the changes in pixels in the image sequence in the time domain and the correlation between adjacent frames, the corresponding relationship between the previous frame and the current frame is found, thereby calculating the motion information of objects between adjacent frames.

The disadvantages of this method are obvious: it requires large amounts of computation and is time-consuming , which means it is inefficient.

To this end, the researchers designed three trainable modules, namely flow completion , feature propagation, and content hallucination , and proposed a flow - guided end-to-end video restoration framework:

E2FGVI 。

These three modules correspond to the three stages of previous optical flow-based methods, but can be jointly optimized to achieve a more efficient inpainting process.

Specifically, for the flow completion module, this method directly completes the operation in one step in mask viedo, instead of taking multiple complicated steps like previous methods.

For the feature propagation module, compared with the previous pixel-level propagation, the flow-guided propagation process in this method is carried out in the feature space with the help of deformable convolution.

Through more learnable sampling offsets and feature-level operations, the propagation module relieves the pressure of previously inaccurate flow estimation.

For the content fantasy module, the researchers proposed a temporal focus Transformer to effectively model long-range dependencies in spatial and temporal dimensions.

At the same time, this module also takes into account local and non-local temporal neighborhoods to obtain more temporally relevant restoration results.

Author: Hope to become the new baseline

Quantitative experiments:

The researchers conducted quantitative experiments on the YouTube VOS and DAVIS datasets, comparing their approach with previous video inpainting methods.

As shown in the table below, E2FGVI far exceeds these SOTA algorithms in all four quantitative indicators , and can generate restored videos with less deformation (PSNR and SSIM) , more visually reasonable (VFID) , and better spatiotemporal consistency (Ewarp) , verifying the superiority of this method.

In addition, E2FGVI also has the lowest FLOPs value (computational complexity) , although the training is performed on 432 × 240 resolution videos, and its HQ version supports arbitrary resolutions.

Qualitative experiments:

The researchers first selected three of the most representative methods, including CAP, FGVC (based on optical flow method) and Fuseformer (selected for ICCV 2021) , and compared the effects of object removal (the first three rows in the figure below) and missing completion (the last two rows in the figure below) .

It can be found that the first three methods are difficult to restore reasonable details in the occluded area, and erasing people will also cause blur, but E2FGVI can generate relatively realistic texture and structural information.

In addition, they selected five methods to conduct user research, and the results showed that most people were more satisfied with the results after E2FGVI repair.

In summary, the researchers also expressed the hope that their proposed method can become a new strong baseline in the field of video restoration.

about the author

E2FGVI was jointly developed by Nankai University and HiSilicon .

The first author Li Zhen is a doctoral student at Nankai University, and the co-first author Lu ChengZe is also from Nankai.

The corresponding author is Cheng Mingming, a professor at the School of Computer Science at Nankai University, whose main research areas are computer vision and graphics.

Currently, the code of E2FGVI has been open sourced , and the author has also provided a Colab implementation, and will provide a demo on Hugging Face in the future.

Paper address:
https://arxiv.org/abs/2204.02663

GitHub homepage:
https://github.com/MCG-NKU/E2FGVI

-over-

The "Artificial Intelligence" and "Smart Car" WeChat communities invite you to join!

Friends who are interested in artificial intelligence and smart cars are welcome to join us, communicate and exchange ideas with AI practitioners, and not miss the latest industry developments and technological advances.

ps. Please be sure to note your name, company and position when adding friends~

click here