(1) We propose the first plug-and-play learning-based video depth framework NVDS, which can be used for any monocular image depth prediction model to remove temporal jitter and enhance inter-frame consistency.
(2) The Video Depth in the Wild (VDW) dataset we proposed is currently the largest and most scene-rich natural scene video depth dataset.
As shown in the figure below, compared with previous video depth prediction methods, the proposed method NVDS has significantly improved spatial accuracy, temporal smoothness, and reasoning efficiency. At the same time, the Video Depth in the Wild (VDW) dataset proposed in this paper is currently the largest and most diverse natural scene video depth dataset.
1. Task Background and Motivation
Video depth prediction has an important impact on many downstream tasks (such as video bokeh rendering, 3D video synthesis, video special effects generation, etc.). An ideal video depth prediction model needs to solve two problems: (1) spatial accuracy of depth; (2) temporal consistency between frames. In recent years, monocular image depth prediction algorithms have significantly improved spatial accuracy, but how to remove jitter and improve temporal consistency between frames remains a difficult problem.
Mainstream video depth prediction methods rely on test-time training. During inference, they use geometric constraints and camera parameters to force a monocular image depth prediction model to overfit the temporal relationship of the current specific test video. This has two obvious disadvantages: (1) poor robustness. Camera parameters are often difficult to be accurate and reliable in many videos, resulting in obvious artifacts and completely wrong prediction results for methods such as CVD and Robust-CVD; (2) low efficiency. Taking CVD as an example, it takes more than forty minutes to process a 244-frame video on four Tesla M40 GPUs.
Therefore, a natural idea is whether we can establish a learning-based video depth prediction method that can directly learn the ability and prior of temporal consistency on the dataset and directly predict better results without the need for test-time training. As with all deep learning algorithms, the design and implementation of such a learning-based video depth method requires addressing two core issues: (1) reasonable model design that can model inter-frame dependencies and improve the consistency of prediction results; (2) sufficient training data to train and stimulate the best performance of the model. Unfortunately, the performance of previous learning-based video depth methods is still inferior to that of test-time training methods, and the effectiveness of the result design still needs further research and exploration. Due to the high cost of annotation, existing video depth datasets are still relatively limited in data volume and scene richness.
2. Methods and Contributions
To address the two core challenges mentioned above, we made two contributions:
(1) We propose the first plug-and-play learning-based video depth framework NVDS. NVDS consists of a depth predictor and a stabilization network. The stabilization network can be directly applied to any monocular image depth prediction model to remove temporal jitter and maintain inter-frame consistency. All previous learning-based video depth prediction models are stand-alone models, whose spatial performance cannot benefit from the monocular image model of sota, and vice versa, they cannot smooth and stabilize a large number of existing monocular image models. The NVDS method breaks the barriers between monocular image depth prediction and monocular video depth prediction. On the one hand, it can benefit from various high-precision single-image models, and on the other hand, it can smooth and stabilize any single-image model, achieving mutual promotion and a win-win situation. For the stabilization network, we use cross-attention to model the inter-frame relationship between the key frame and the reference frame. At the same time, we design a bidirectional prediction mechanism to expand the temporal receptive field and further improve consistency.
(2) We proposed the Video Depth in the Wild (VDW) dataset, which is currently the largest and most scene-rich natural scene video depth dataset. Due to the huge annotation cost, most of the existing video depth datasets are closed scenes. The few natural scene video depth datasets are far from sufficient in size and richness. For example, Sintel only contains 23 animation videos. Our VDW dataset is collected from a variety of data sources such as movies, animations, documentaries, and online videos. It contains 14,203 videos of more than 200 hours and a total of 2.23 million frames. We also designed mechanisms such as sky segmentation voting, as well as strict data screening and annotation processes to ensure the accuracy of our data. The figure below contains some examples of datasets, from online videos, documentaries, animations, and movies.
3. Experimental Overview: Methods
In terms of experiments, we have achieved SOTA spatial accuracy and temporal consistency on the VDW dataset, as well as the public Sintel and NYUDV2 datasets. VDW and Sintel are natural scene datasets. For closed scene data such as NYUDV2, we can achieve SOTA performance by training with the unified NYUDV2 training set instead of our VDW dataset; and pre-training with our VDW dataset and then fine-tuning on the closed scene NYUDV2 can further improve the performance of the model.
At the same time, to prove the effectiveness of our plug-and-play, we used three different depth predictors for experiments, and our NVDS achieved significant improvements.
We also demonstrated the effectiveness of bidirectional inference through ablation. Unidirectional (forward or backward) prediction can already achieve satisfactory consistency, while bidirectional inference can further expand the temporal receptive field and improve consistency.
Some qualitative results are shown in the figure below. Our NVDS method has achieved significant improvements. Each group of examples has an RGB frame on the left and a video time domain slice on the right. Fewer stripes in the slice indicate better consistency and stability. For more visualization results, please refer to our paper, supplementary materials, and results video.
4. Experimental Overview: Dataset
For the VDW dataset, we compared existing video depth datasets. Our VDW dataset is currently the largest and most scene-rich natural scene video depth dataset.
We also explored the effect of using different data to train the model. Since our VDW dataset has the best volume and scene richness, for the same model, training with the VDW dataset achieved the best performance.
For the statistical experiments of the dataset, we plotted the word cloud of the object category of the dataset, as well as the statistical graph of the semantic category, etc. For more statistical results and examples of the dataset, please see the paper and supplementary materials.
5. Open source code and datasets
Our code is open source:
https://github.com/RaymondWang987/NVDS
The dataset is building the official website of VDW and drafting the corresponding open source agreement. It will be released as soon as it is ready. Our dataset is large, so it will take some time to build the website and transfer the data. We will split the data and upload it gradually. The VDW dataset can be used for academic and research purposes, but not for commercial purposes.
Previous article:Multi-directional microphone amplifier circuit schematic diagram
Next article:What is a Class A amplifier and a Class B amplifier? Which one is better?
- Popular Resources
- Popular amplifiers
- Huawei's Strategic Department Director Gai Gang: The cumulative installed base of open source Euler operating system exceeds 10 million sets
- Analysis of the application of several common contact parts in high-voltage connectors of new energy vehicles
- Wiring harness durability test and contact voltage drop test method
- Sn-doped CuO nanostructure-based ethanol gas sensor for real-time drunk driving detection in vehicles
- Design considerations for automotive battery wiring harness
- Do you know all the various motors commonly used in automotive electronics?
- What are the functions of the Internet of Vehicles? What are the uses and benefits of the Internet of Vehicles?
- Power Inverter - A critical safety system for electric vehicles
- Analysis of the information security mechanism of AUTOSAR, the automotive embedded software framework
Professor at Beihang University, dedicated to promoting microcontrollers and embedded systems for over 20 years.
- Innolux's intelligent steer-by-wire solution makes cars smarter and safer
- 8051 MCU - Parity Check
- How to efficiently balance the sensitivity of tactile sensing interfaces
- What should I do if the servo motor shakes? What causes the servo motor to shake quickly?
- 【Brushless Motor】Analysis of three-phase BLDC motor and sharing of two popular development boards
- Midea Industrial Technology's subsidiaries Clou Electronics and Hekang New Energy jointly appeared at the Munich Battery Energy Storage Exhibition and Solar Energy Exhibition
- Guoxin Sichen | Application of ferroelectric memory PB85RS2MC in power battery management, with a capacity of 2M
- Analysis of common faults of frequency converter
- In a head-on competition with Qualcomm, what kind of cockpit products has Intel come up with?
- Dalian Rongke's all-vanadium liquid flow battery energy storage equipment industrialization project has entered the sprint stage before production
- Allegro MicroSystems Introduces Advanced Magnetic and Inductive Position Sensing Solutions at Electronica 2024
- Car key in the left hand, liveness detection radar in the right hand, UWB is imperative for cars!
- After a decade of rapid development, domestic CIS has entered the market
- Aegis Dagger Battery + Thor EM-i Super Hybrid, Geely New Energy has thrown out two "king bombs"
- A brief discussion on functional safety - fault, error, and failure
- In the smart car 2.0 cycle, these core industry chains are facing major opportunities!
- The United States and Japan are developing new batteries. CATL faces challenges? How should China's new energy battery industry respond?
- Murata launches high-precision 6-axis inertial sensor for automobiles
- Ford patents pre-charge alarm to help save costs and respond to emergencies
- New real-time microcontroller system from Texas Instruments enables smarter processing in automotive and industrial applications
- TI C6000 CodecEngine integrated algorithm core calling principle
- Commonly used techniques in PCB design
- [Silicon Labs Development Kit Review] Using TensorFlow to Prototype Gesture Recognition Project
- When IAR STM8 uses registers as uart, an error occurs when writing the receive interrupt. Please solve it
- EEWORLD University Hall----Live Replay: ADI Reference Voltage Source Product Technology and Application Selection
- Revolutionizing radar design with electronically reconfigurable GaN power amplifiers
- About the debugging of ML75308 optical rainfall chip???
- R&D Management
- [Raspberry Pi Pico Review] Power-on initial test
- Free application: Qinheng RISC-V core BLE 5.3 wireless MCU CH582