Kechuang丨The secret of the domestic version of Sora is hidden in the large model team of Shengshu Technology
·Focus: Artificial intelligence, chip and other industries
Welcome all guests to pay attention and forward
Recently, Shengshu Technology and Tsinghua University jointly launched the country's first large-scale video model based on the independently developed U-ViT architecture - Vidu.
It can quickly generate 16-second, high-definition 1080p resolution video content, demonstrating performance comparable to Sora, especially in terms of multi-shot generation, spatio-temporal consistency, simulation of the real physical world, and innovation capabilities.
Vidu has a significant advantage in video generation duration, breaking through the long-standing ten-second threshold of domestic Sora.
Shengshu Technology adopts a complete end-to-end generation method of a single model to achieve continuous and silky video content generation without the need for frame interpolation processing.
Specifically, Vidu can generate scenes with complex details, comply with real physical laws, and display reasonable light and shadow effects and delicate character expressions.
At the same time, it can also generate fictional images that do not exist in the real world and create surreal content.
In terms of lens language, it is no longer limited to simple lenses, but can switch between different lenses such as long shots, close shots, medium shots, and close-ups, and generate effects such as long shots, focus tracking, and transitions.
It is worth mentioning that Vidu adopts a [one-step] generation method. Like Sora, the conversion of text to video is direct and continuous.
In terms of underlying algorithm implementation, Vidu is completely end-to-end generated based on a single model, without involving intermediate frame insertion and other multi-step processing.
This innovative technology brings new breakthroughs and possibilities to the field of video generation.
Entrepreneurship from Tsinghua University, a two-way model
The name Vidu is not only homophonic with [Video], but also contains the meaning of [We do], demonstrating the spirit of action and practice.
According to public information, Shengshu Technology was established in March 2023. Its core members are all from the Artificial Intelligence Research Institute of Tsinghua University and are committed to independently developing the world's leading controllable multi-modal general large model.
Zhu Jun, the chief scientist of Shengshu Technology, is not only a professor in the Computer Department of Tsinghua University, but also the vice president of the Artificial Intelligence Research Institute.
At the same time, Tang Jiayu, CEO of Shengshu Technology, received his bachelor’s and master’s degrees from the Computer Science Department of Tsinghua University;
CTO Bao Fan is a doctoral student in the Computer Science Department of Tsinghua University, and as a member of Professor Zhu Jun’s research group, he jointly promotes research and development work.
Shengshu Technology currently adopts a parallel strategy of model layer and application layer.
On the one hand, they are working on building a low-level general-purpose large model that covers multi-modal capabilities such as text, images, videos, and 3D models to provide model service capabilities for the B-side;
On the other hand, they also create professional applications in scenarios such as image generation and video generation, and charge fees through subscriptions and other methods.
These applications are mainly oriented to content creation scenarios such as game production and film and television post-production, demonstrating the dual strength of Biotech in technology and market applications.
Taking the right technical route, the advantages of the integration framework emerge
The Vidu and Sora video generation technologies recently released by Shengshu Technology are significantly different from the traditional diffusion model based on the U-Net convolution architecture that is mainstream in the market. They adopt cutting-edge fusion architectures, namely U-ViT and DiT.
This fusion architecture is an organic combination of Diffusion (diffusion model) and Transformer, aiming to take advantage of Transformer’s scalability;
At the same time, the natural strengths of the Diffusion model in processing visual data are retained, thereby showing excellent performance in visual tasks.
Looking back on the research and development history of Shengshu Technology in the field of video generation technology, as early as 2017, the team released the Bayesian probability machine learning platform [abacus].
This platform is one of the earliest programming libraries for deep probabilistic models in the world, supporting probabilistic modeling of a variety of deep generative models including GAN, VAE, Flow, etc.
In early 2022, the team proposed the training-free inference framework Analytic-DPM, which greatly improved the sampling efficiency by directly estimating the optimal variance. Compared with the traditional model DDPM, it was accelerated by nearly 20 times.
This result was selected as an ICLR 2022 Outstanding Paper and was applied by OpenAI in the DALL·E 2 model processing strategy.
In June of the same year, the team innovated again and proposed the sampling algorithm DPM-Solver, which can obtain high-quality sampling in only 10 to 15 steps.
This result was selected into the NeurIPS 2022 Oral and adopted by a large number of open source projects such as Stable Diffusion. It is still one of the fastest image generation algorithms in the world.
With the continuous advancement of technology, in September 2022, the team published the U-ViT paper, proposing for the first time an architectural idea that integrates the diffusion model with Transformer.
The subsequent DiT architecture also followed this innovative concept and was eventually adopted by Sora.
Compared with traditional Transformer, U-ViT significantly improves the training convergence speed by introducing [long connection] technology.
In March 2023, the team trained UniDiffuser, a model with nearly 1 billion parameters, on the large-scale graphic and text data set LAION-5B based on the U-ViT architecture and made it open source.
UniDiffuser not only supports arbitrary generation and conversion between image and text modalities, but its implementation verifies the scalability (Scaling Law) of the fusion architecture in large-scale training tasks, marking all steps of the fusion architecture in large-scale training tasks. All have been effectively verified.
It is worth mentioning that compared to Stable Diffusion 3, which has only recently switched to DiT architecture, UniDiffuser is one year ahead in the field of graphic and text models.
Based on resource considerations, the Sora team chose a high-intensity work mode and devoted all its efforts to the research and development of long videos, while Shengshu Technology chose to start with 2D images and gradually expand to 3D and video fields.
In January this year, Shengshu Technology officially launched the 4-second short video generation function. After Sora was released in February, the company quickly tackled the problem and achieved a breakthrough in 8-second video generation in March. In April, it reached 16 seconds in length. A breakthrough, the generation quality and duration have been comprehensively improved.
Completed three rounds of financing and became the domestic valuation leader
After multiple rounds of rigorous capital operations, Shengshu Technology successfully completed its first round of financing in June 2023, led by Ant Group, with follow-up investment from BV Baidu Ventures and Zhuoyuan Capital.
After this financing, the company's valuation has reached US$100 million.
These funds will be invested in the construction of the core R&D team and product research and development to promote the company's sustainable development.
It is worth mentioning that since ChatGPT was released in November last year, Shengshu Technology has become the first AIGC project invested by Ant Group, and it is also the third important investment project of Baidu Ventures in the field of AI content generation.
In August 2023, Shengshu Technology once again received exclusive investment from Jinqiu Fund and completed an angel + round of financing of tens of millions of yuan.
The funds will be mainly used for algorithm research and development, product development and team expansion, injecting new impetus into the future development of Shengshu Technology.
By March 2024, Shengshu Technology successfully completed hundreds of millions of yuan in Series A financing.
This financing received support from new institutions such as Qiming Venture Partners, Datai Capital, and Zhipu AI, and also received continued investment from two old shareholders, BV Baidu Ventures and Zhuoyuan Asia.
After three rounds of financing, Shengshu Technology has received hundreds of millions of yuan in investment, making the company one of the most highly valued startups in the domestic multi-modal large-scale model industry.
At the same time, the Shengshu Technology team also launched an industrial-level universal basic model (closed source version) based on a unified multi-modal multi-task framework, demonstrating the company's deep strength and innovative spirit in the field of AI.
The core team of Shengshu Technology is not only one of the first teams to lay out multi-modal large models, but also has rich experience and outstanding achievements in the basic theory and algorithm research of diffusion probability models.
At present, Shengshu Technology is one of the teams in China that has published the most papers in the field of diffusion probability models, which fully proves the company's leading position and strong R&D capabilities in the field of AI.
Conclusion : The market has broad prospects and needs to be continuously developed.
Vincent Video technology is expected to lead productivity changes in the field of video creation, significantly reducing production costs and creation difficulty, and is expected to be the first to be implemented in the fields of short videos and animations.
CCB International pointed out that the Vincent video model has broad application prospects in multiple industries, including but not limited to marketing advertising, R&D training, e-commerce retail, and entertainment games.
According to data from Bloomberg Intelligence, the global AIGC market size is expected to increase significantly from US$67 billion in 2023 to US$897 billion in 2030, indicating that the compound annual growth rate of this field will reach an astonishing 45%.
For the Chinese market, iResearch predicts that its industry scale will grow rapidly from RMB 14.3 billion in 2023 to RMB 1,144.1 billion in 2030, with a compound annual growth rate of as high as 87%.
This trend shows the huge potential and broad prospects of Vincent Video in the Chinese market.
Reference for some information: Geek Park: "The secret of domestic Sora is hidden in this Tsinghua model team", Heart of the Machine: "Is it possible for domestic companies to make Sora?" This large model team from Tsinghua University gives hope", China News Network: "China's first! Comprehensively Benchmarking Sora", Lieyun Selection: "Backed by Tsinghua, [the strongest domestically produced] Sora is here", Computing Power Leopard: "[Challenging] Sora, Tsinghua Zhu Jun's "Shengshu Technology" raised hundreds of millions more, Qiming Venture Capital Leads Investment》
Recommended reading: