The Chinese version of GPT-4o explodes: China's first streaming multimodal interaction model, real-time and smooth on-site
Jin Lei from WAIC
Quantum Bit | Public Account QbitAI
Before GPT-4o was released, SenseTime released "Her" first !
Just now, SenseTime held a live show that blew up the show. Without further ado, let’s take a look at the results:
Not only is the voice very human-like (the audience exclaimed that it is very magnetic) , but it is also real-time and can be interrupted at any time !
It is like having a pair of eyes installed, which can accurately see what it sees .
Even with the rough hand-drawn sketches , AI can interact with humans in a playful way:
After a series of live shows, the audience burst into applause and exclaimed "Wow".
This is the effect achieved by 5o in the 5.5 series released by SenseTime, the first streaming native multimodal interaction model in China with 600 billion parameters .
It is understood that this is a new AI interaction mode that incorporates all modalities such as text, sound, images and videos, allowing AI to communicate with people more vividly and richly.
It can be said that true movies have entered reality.
Moreover, SenseTime CEO Xu Li said on the spot that it will be available soon!
But the new AI interaction model is just a small part of SenseTime’s release this time.
Looking at the whole event, SenseTime can be said to have made great use of multimodality, centering on the new 5.5 every day.
Dear readers, let’s continue reading.
Computer giants are "alive"
You read that right. Another trick SenseTime has done with its new AI is to "resurrect" computer giants such as Turing and von Neumann.
He also paid tribute to the late artificial intelligence scientist and founder of SenseTime, Mr. Tang Xiaoou. Xu Li said:
I would like to pay tribute to our founder, Professor Tang Xiaoou, for his dedication to artificial intelligence and talent cultivation, which has laid the foundation for us to stand here today and share with you some of our ideas on artificial intelligence.
Please watch the VCR:
This new AI, named Vimi , is the first large model for controllable character video generation based on the new 5.5 capabilities of RiRi .
And it only needs one photo of any style to complete the task, and it can be used by ordinary users, and it lasts up to 1 minute ~
You know, "controllable characters" has always been a difficult problem when using large models. Even large models such as Sora face problems such as inability to accurately control movements and unstable continuity (sudden changes in face) .
But Vimi is different. It can not only precisely control the character's facial expressions, but also adjust the character's natural posture within the scope of the bust.
It can also automatically generate changes in hair, clothing, and background that match the characters; the duration can even reach minutes.
Therefore, if you want to create your own blockbuster in the future, such as the Snow Queen, it will only take one photo:
Think that's the end? No, No, No.
Your emoticon bag is about to become richer
.
All in all, the emergence of Vimi can be said to be beneficial to video creators, giving them another choice of high-quality AI tools.
It is worth mentioning that Vimi was also awarded the highest honor by the World Artificial Intelligence Conference (WAIC) - the Treasure of the Museum .
How did you do it?
SenseTime also made a big reveal on the spot about the killer technology behind achieving the above-mentioned effects.
One aspect is architecture.
RiRiXin 5.5 adopts the hybrid end-edge-cloud collaborative expert architecture , which can maximize the cloud-edge-end collaboration and reduce the cost of inference.
The other side is data.
The model training of RiRiXin 5.5 is based on more than 10TB tokens of high-quality training data , including a large amount of synthetic thought chain data, and its language understanding and interaction capabilities have been comprehensively upgraded.
Therefore, RiRiXin 5.5 has significant improvements over the previous version in multiple dimensions such as mathematics, reasoning, and programming, especially in core indicators such as mathematical reasoning (↑31.5%), English comprehension (↑53.8%) , and instruction following (↑26.8%) .
So how can this be reflected? The authoritative evaluation list is a good proof.
For example, according to the evaluation of OpenCompass, the average score of 5.5 is on par with GPT-4o, and the scores of multiple segmentation dimensions exceed GPT-4o.
Not 999, not 99, just 9.9 yuan for the whole year
In addition to multimodality, the terminal side is also one of the key focuses of SenseTime this time.
Now, the new end-side model 5.5 Lite has also been fully upgraded in all dimensions of performance indicators.
Based on the flagship mobile platform, the 5.5 Lite takes only 0.19 seconds to install the package for the first time, which is 40% less than the previous version.
Its inference speed increased by 15%, reaching a processing speed of 90.2 Chinese characters per second.
In addition, SenseTime has also launched an end-side model matrix, which includes specially customized models such as SenseTime Mini Writing Assistant, Summarization Assistant and Encyclopedia Assistant.
These specialized models have better performance in corresponding scenarios and can meet the needs of customers' complex business scenarios. At the same time, they can also provide different specialized models for customers to choose or customize.
Moreover, the large end-side model based on the new 5.5 also achieves "more, faster and better" and "saving" - the lowest cost per unit can be as low as 9.9 yuan per year .
On the enterprise side, SenseTime has reached cooperation with more than 3,000 corporate users, covering areas including the Internet, medical care, finance, programming, etc.
When it comes to price and inclusiveness, we have to mention SenseTime’s “0 Yuan Go” plan:
Starting today, users of SenseNova will receive a package of free services including calling, migration, and training.
At the same time, a 50 million Tokens package will be given away, and a dedicated moving consultant will be sent to ensure that new users can settle into their new home comfortably and smoothly.
After watching the entire release of SenseTime, we still need to answer a question:
Why is it important to reinvent interaction?
Regarding this issue, SenseTime CEO Xu Li gave his interpretation:
My previous thought was that although the industry we are in is very hot, it has not yet reached its super moment because it has not truly entered the vertical application of an industry to cause widespread changes.
But now my thoughts have changed a little bit. Super moments and applications should complement each other. Only the changes in cognition brought about by super moments can ultimately promote such an application.
Therefore, application may become the key to determining whether this era is a super moment for artificial intelligence.
This is why SenseTime launched a streaming native multimodal interaction model. Only by achieving richer and more accurate multimodality, as well as lower latency and greater control, can the application be taken to a higher level.
In short, the ideas are clear, technology is constantly improving, and the super moment of AI 2.0 may be accelerating towards us.
-over-
Click here ???? Follow me, remember to mark the star~