You can have Harry Potter's invisibility cloak in minutes, or on your mobile phone
Yuyang from Aofei Temple
Quantum Bit Report | Public Account QbitAI
How many steps are needed to perform a magical resurrection?
I never expected that such a classic large-scale magic trick can now be learned from zero basic knowledge.
At the "One Thousand and One Nights" party jointly created by Kuaishou and Jiangsu Satellite TV, Di Lieba performed one on the spot.
No props, no help, no time or place required, she just
flashed in the
live broadcast
camera.
And the fluctuation of the potential field does not destroy the integrity of the background at all. (Manual dog head)
The most important thing is that if you have a mobile phone with Kuaishou installed, you can achieve the same.
That’s right, this black technology that can achieve the invisibility special effects in blockbuster movies in real time is the new AI gameplay recently launched by Kuaishou - "Invisibility Magic". This is the first application of a video restoration algorithm that combines single-image restoration and inter-frame image alignment technology in the short video industry.
Not only can you "appear out of thin air" and make a "6" gesture to the camera, you can also turn yourself into a transparent person and blend into the air on the spot.
The effect is so smooth that it is no wonder that Kuaishou users have been enjoying it so much within a few days of its launch, quickly contributing 775,000 related works.
It became the hottest special effects gameplay for short videos of the year.
Real-time video restoration technology on mobile devices
How to make people in the video invisible in real time?
Previously, QuantumBit introduced similar "stealth" algorithms, such as the video restoration algorithm based on optical flow edge guidance jointly created by the University of Virginia and Facebook .
Although there are precedents in the academic field, there are still considerable challenges in applying such technology to mobile devices.
The main problem is the amount of computation . Video restoration involves multi-frame calculations, and its deep learning model generally has a large amount of computation, making it difficult to run on mobile devices.
So how did Kuaishou do it? As usual, QuantumBit will explain it one by one.
In fact, the principle is very simple. If you want to erase a person in the picture, in addition to automatically cutting out the portrait, AI must also learn to fill in the real background blocked by the portrait.
This involves two issues:
-
Background repair of the portrait area in the initial frame
-
Background filling of the portrait area during subsequent camera and character movement
In order to solve these two problems, Kuaishou engineers divided the algorithm into two stages:
The first frame uses the mobile brain-filling model to fill the background of the portrait area, and the subsequent frames use real-time tracking and matching projection between frames to fill the visible background area into the area occluded by the character.
Image restoration algorithm based on DeepFill
First, let’s look at first frame restoration. Specifically, in terms of model architecture, Kuaishou engineers mainly developed and optimized the open source DeepFill model based on actual needs.
DeepFill is an image restoration method based on GAN, and its restoration capabilities are as follows:
On this basis, Kuaishou adopts the coarse to refine two-stage structure in the entire model design.
In the first stage, preliminary restoration is performed on a small scale, and the rough outline of the missing area is obtained using a coarse network with less computational effort.
In the second stage, the preliminary results are integrated into the original image at a large scale and the refine network is used to generate details of the missing areas.
In order to enable the model to be better deployed and run on mobile devices, engineers also used pruning and distillation methods to further compress the model structure.
During the algorithm development process, engineers also found that the larger the missing area, the more uncontrollable the image restoration result, and the use of L1 loss and GAN loss cannot effectively constrain the rationality of the structure and semantics of the repaired area.
To address this problem, on the one hand, we use the method of boundary generation joint training to directly constrain the boundary structure information, which significantly improves the rationality of the restoration results in the case of large missing areas. On the other hand, we use the method of multi-scale prediction to constrain the features of the middle layer of the model, which effectively improves the clarity of the restoration results.
In terms of loss function , engineers used SSIM, Lpips perceptual loss, PatchGan loss, and distillation loss during training, and achieved good image restoration results on small models.
In terms of training data , Kuaishou engineers built a general image restoration dataset containing 1 million background images and 100,000 portrait masks, including common environments such as home, office, architecture, landscape, and virtual CG.
In addition, the background data is classified according to its texture complexity. During the model training process, as the network gradually converges, the proportion of complex texture data is gradually increased, so that the model can better complete the restoration of various backgrounds from simple to complex.
After a series of combined punches, the test results are as follows. From left to right, they are the input image, boundary prediction, mental result, and actual background.
Real-time tracking projection matching
In the background restoration of subsequent frames, in order to make better use of the existing background information, it is necessary to project the existing background into the current frame to restore the portrait occluded area, that is, inter-frame image mapping.
There are currently three main ways to describe the mapping relationship between images between frames: simple global homography transformation, grid-based local homography transformation, and complex pixel-by-pixel dense optical flow.
Among them, although the global homography transformation has a small amount of calculation, it cannot describe complex three-dimensional structure mapping.
The pixel-by-pixel dense optical flow algorithm can obtain the accurate mapping relationship between visible pixels in images, but it cannot repair unknown areas in the portrait area. In addition, due to the limitation of the computing power of the mobile phone platform, the algorithm cannot meet the demand of obtaining the mapping relationship in real time.
Therefore, Kuaishou uses an image alignment algorithm based on local homography transformation of grids to balance the relationship between computational complexity and accuracy. By optimizing the photometric error of feature points between frames and the deformation error of grids at the same time, accurate inter-frame mapping relationships can be obtained with low computational complexity, effectively propagating the visible area information of historical frames to the current screen in real time.
Moreover, by adjusting the number of grids, the calculation amount of the algorithm and the mapping accuracy can be easily adjusted to achieve algorithm adaptation for multiple models.
It can be used on both mid-range and low-end models. It’s a real Muggle “magic”
In fact, for Kuaishou’s engineers, simply achieving the effect is far from enough.
More importantly, with a wide variety of mobile hardware, it is necessary to cover high-end, mid-range and low-end mobile phone models, so that the capabilities of each grade of model can be maximized.
On the one hand, it is because every time a product is launched, it affects the actual experience of 400 million users, and one move affects the entire body.
On the other hand, due to the user characteristics of Kuaishou, the models of mobile phones in users' hands will be widely distributed, and the computing power and memory resources of different models vary greatly.
To achieve this, Kuaishou relies on its self-developed YCNN deep learning inference engine .
Take the CPU for example, whether it is Apple, Qualcomm, Huawei or MediaTek chips, whether it is the high-end Snapdragon 865 or the low-end Snapdragon 450, 430, the YCNN engine can support the model to run on it. Similarly, in terms of GPU, the YCNN engine supports multiple GPUs such as Mali, Adreno, Apple and NVIDIA. In terms of NPU, Apple Bionic, Huawei HiAI, Qualcomm SNPE and MTK APU are all within the support range.
At the same time, the YCNN engine has a complete model structure and numerical precision, supports common CNN and RNN structures, and supports calculations of different precisions such as float32, float16, and uint8.
In order to make full use of the computing power of mobile phones, the YCNN engine also provides a variety of models, including large models designed for high-computing power NPUs, small models of different levels designed for high-end CPUs and GPUs, and specific small models designed for mid- and low-end CPU processors. At the same time, by sending models, the best computing power on the device is matched with the corresponding model in order to achieve the best balance between effect and performance and bring the best experience to users.
In terms of inference engine optimization, Kuaishou engineers have designed Metal operators, OpenCL operators, Neon operators, etc. for different device ends, and optimized the operators in a targeted manner to maximize the use of device performance and improve the computing speed of the model.
In addition, the YCNN engine has a complete AI model tool chain, supports PyTorch, TF/TFlite models can be directly converted to YCNN models, and supports model quantization during training and hardware-based model structure search. The overall performance is about 10% better than the industry's engines.
The Way of Kuaishou
Finally, let’s return to AI special effects, the party, and Kuaishou itself.
Kuaishou's technology and AI special effects have been introduced before. This technology company, which has grown rapidly through short videos, has brought the latest and most cutting-edge technology to more people, and through technology, it has also allowed users to experience the transition from "recording every life" to "embracing every life."
But what is more commendable is that Kuaishou’s approach lies in its mentality when facing the most cutting-edge technology - it hopes to allow every user to use it without distinction and experience the fun of technology, regardless of the model or signal coverage.
Now, this fast way is being extended offline, giving online users the opportunity to step onto the offline stage, appear with celebrities, and show themselves. From online to offline, across platforms and communities.
The "One Thousand and One Nights" super luxurious lineup gala created by Kuaishou after "nine years of hard work" is the most direct example.
On the one hand, Kuaishou and Jiangsu Satellite TV joined hands to realize the deep integration of large and small screens in the program core and presentation form. Not only the real-time invisible special effects, but also the low-latency microphone connection between Huang Bo and Jay Chou, and the F4 on the same stage under the blessing of virtual technology, etc., all brought a new viewing experience to the audience.
On the other hand, it can be seen from the star lineup that Kuaishou’s appeal is increasing, and the evening party format with stars and amateurs on the same stage has formed Kuaishou’s unique cultural IP.
The data also proves this point. It is reported that the total number of viewers of the Kuaishou official live broadcast room for this evening party reached 90.08 million , the total number of interactions in the live broadcast room reached 134 million times , the maximum number of simultaneous online users reached 3.15 million , and the total number of people who booked the live broadcast reached 31 million .
Such huge traffic and attention is undoubtedly also a popularization of technological values.
On the stage, there are cross-border collaborations between celebrities and Kuaishou masters; off the stage, Kuaishou also firmly adheres to the technical belief of improving user experience and creating user value, so that all kinds of "human fireworks" of both highbrow and lowbrow can add color to life through cutting-edge technologies such as AI technology.
This is the side beyond rational technology: using fantastic technology to break down human barriers.
There used to be an old saying that technology is Muggle magic.
But compared to this kind of magic, the engineers who create the magic itself and make magic truly indiscriminately applied are not easily pushed into the spotlight, but they still deserve applause and praise.
Can you think of other examples of “magic”?
Finally, the technical team that developed this special effect is the Kuaishou Y-tech team, and I would like to specially send it here:
This team is committed to technological innovation and business implementation in the fields of computer vision, computer graphics, machine learning, AR/VR, etc., and constantly explores the best combination of new technologies and new user experience. Currently, Y-tech has R&D teams in Beijing, Shenzhen, Hangzhou, Seattle, and Palo Alto, and most of its members come from internationally renowned companies and universities.
-over-
This article is the original content of [Quantum位], a signed account of NetEase News•NetEase's special content incentive plan. Any unauthorized reproduction is prohibited without the account's authorization.
List collection! 7 major awards for top AI companies
Quantum Bit QbitAI · Toutiao signed author
Tracking new trends in AI technology and products
One-click triple click "Share", "Like" and "Watching"
Advances in science and technology are being seen every day~