"
Large models will usher in a 'hundred schools of thought' in April and May, and the battle will become increasingly fierce, with new players still entering the market one after another.
"
Author | Huang Nan
Editor | Chen Caixian
Leifeng.com learned: Recently, a multi-modal large model startup company "Sophon Engine" completed an angel round of financing worth 10 million yuan. It is understood that the CEO of "Sophon Engine" is Gao Yizhao, a young man born in the 1990s. Gao Yizhao is a doctoral student at Renmin University, studying under Lu Zhiwu, who serves as a consultant in the "Sophon Engine" company. In addition, Lu Zhiwu serves as the chief AI scientist of iSoftStone.
Before ChatGPT became popular, Beijing Zhiyuan Artificial Intelligence Research Institute took the lead in pioneering large-scale model research in China, called "Enlightenment". At that time, four major forces were gathered, including Tang Jie of Tsinghua University, Liu Zhiyuan of Tsinghua University, Huang Minlie of Tsinghua University, and Wen Ji of the National People's Congress. Rong takes the lead (for details, please pay attention to Leifeng.com's follow-up in-depth report "Behind-the-scenes details of Zhiyuan's development of China's large models". Interested readers are welcome to add the author's WeChat:
Fiona190913
).
Among them, Wen Jirong of the Renmin University of China mainly led the scientists of the Hillhouse School of Artificial Intelligence of the Renmin University to develop the direction of multi-modal large models, named "Wenlan". Lu Zhiwu served as the main model force in the team, and his student Gao Yizhao also participated Came in and completed the core research work. After "enlightenment", Tang Jie, Liu Zhiyuan and Huang Minlie all established companies based on large model technology, and the entry of the National People's Congress opened a perfect prelude to the entrepreneurial lineup of the "Four King Kong" of Zhiyuan Large Model.
According to Leifeng.com, Lu Zhiwu’s team is also the first team in China to study multi-modal large models and achieve outstanding technical results.
Lu Zhiwu and Gao Yizhao started working on multi-modal large models in 2020.
In May 2020, GPT-3 developed by OpenAI set off a huge wave in the field of artificial intelligence, attracting the attention of domestic practitioners on pre-trained large models, including Lu Zhiwu and others.
Lu Zhiwu studied in the Department of Information Science, School of Mathematical Sciences, Peking University in his early years. After graduating with a master's degree, he obtained a PhD from the Department of Computer Science, City University of Hong Kong in 2011. His main research directions include machine learning, computer vision, etc.
Lu Zhiwu
At that time, most people in China focused on the field of NLP, but few people paid attention to the multi-modal large models that expanded from text to images and videos.
During this period, the Hillhouse Artificial Intelligence Department of Renmin University of China established a multi-modal large model R&D team specializing in the research and development of graphic and text multi-modal pre-training models. It was led by Wen Jirong, and other core members included Song Ruihua, Lu Zhiwu and others. , which is also the first team in China to engage in multi-modal large model research.
In the same year, Gao Yizhao entered the Renmin University of China’s Hillhouse School of Artificial Intelligence to pursue a doctoral degree, studying under Lu Zhiwu.
Gao Yizhao
"Sophon Engine" will launch multi-modal large models
In fact, as early as three years before ChatGPT was born, Beijing Zhiyuan Artificial Intelligence Research Institute had taken the lead in starting research on a large model in China called "Wudao". Among them, scientists from Renmin University's Hillhouse School of Artificial Intelligence, led by Wen Jirong, Formed the "Wudao·Wenlan" team to engage in research on multi-modal large models, with Lu Zhiwu serving as the main force in model development.
In March 2021, based on the pre-training of 30 million image and text data sets, the first-generation "Wenlan"-image and text retrieval model BriVL was officially launched. This is a very large-scale multi-modal pre-training model that uses dual The tower structure can encode images and text separately, and learn the similarity between images and text through self-supervised tasks.
Based on the image and text retrieval model, the research team also developed an H5 small application "AI Mood Radio". You only need to provide a picture to the AI genie, and the model can match a suitable piece of music to the picture.
Three months later, Lu Zhiwu’s Wenlan team released “BriVL-2” (BriVL-2).
Based on the hypothesis of weak correlation between vision and language, the research team proposed the hypothesis of weak correlation between images and text, designed an efficient cross-modal contrastive learning strategy, and proposed a distributed multi-modal training framework based on DeepSpeed to improve the model's expressive ability and Generalization.
Based on the pre-training of 650 million weakly related image and text data sets, Wenlan 2.0 has a model capacity of 5 billion parameters. It is currently the largest Chinese general image and text pre-training model and can cover multiple fields and scenarios. It has achieved excellent performance in text retrieval and generation tasks, such as image retrieval, image description, visual question answering, etc.
During this period, Gao Yizhao was also deeply involved in the graphic and text pre-training work of Wenlan 1.0 and 2.0, and was mainly responsible for data processing, model training and evaluation, etc.
In the heat of ChatGPT, Lu Zhiwu and Gao Yizhao saw new opportunities for multimodal research in the era of large models, and established the multimodal large model company "Sophon Engine". Drawing on previous experience in participating in the development of Wenlan models, the "Sophon Engine" team officially launched a self-developed multi-modal dialogue large model on March 8 this year, and released the first application-level multi-modal ChatGPT product "Yuanlan". Multiply the elephant ChatImg".
"Yuanchengxiang ChatImg" has tens of billions of parameters. It mainly uses image-text pair data and VQA data as training sets, and simultaneously performs multiple tasks such as image-text matching, image-text retrieval, image description generation, and text description generation. train. According to the pictures or text input by the user, "Yuancheng Xiang ChatImg" can conduct intelligent chat, tell stories, write advertisements, etc.
Since April and May, the large-scale models that have been unveiled one after another have caused a lot of noise and excitement, with big manufacturers and start-up companies making no concessions. It is a major trend for academia to enter the field of large models. How to find one's competitiveness and position in the competition close to engineering requires urgent answers from the race against time.