Experts’ Views | Promoting the Development of Intelligent Vision Technology in the Era of Big Models
Click on the Arm Community above to follow us
By: Jian Ma, Vice President of Business Development, IoT Division, Arm
(Thanks to Catherine Wang, Chief Architect of Computer Vision, Arm Engineering, for her contribution to this article)
Noam Chomsky, a pioneer in linguistics and cognitive science, once said that human language is unique in the animal world. Today, with the rapid development of large language models (LLMs) and generative artificial intelligence (AI) such as GPT-3.5, GPT-4.0, and BERT, machines have begun to understand human language, which has greatly expanded the functions that machines can perform. This has also caused people to think: How will technology develop next?
The evolution of intelligence shapes a new computing paradigm
To predict where AI is headed, we only need to look back at ourselves. We change the world through a dynamic interaction of senses, thoughts, and actions. This process involves sensing the world around us, processing information, and responding thoughtfully.
In the history of computing technology, we have witnessed that the abilities of perception, thinking, and action that were once unique to humans have gradually been mastered by machines. Each transfer of capabilities will give birth to a new paradigm.
At the end of the 20th century, large companies like Google transformed the cost of obtaining information from marginal cost to fixed cost. Specifically, Google invested money to crawl the web and index information, but for each of us who searched for information, the cost of investment was almost negligible. Machines began to become our information system. This opened the Internet era and the subsequent mobile Internet era, changing the way people obtain, disseminate and share information, and had a profound impact on many fields such as business, education, entertainment, and social interaction.
We are now witnessing a new turning point in the development of technology, where the ability to think, reason, and build models is shifting from humans to machines. OpenAI and big models have transformed the cost of producing models from marginal costs to fixed costs.
The big models have been trained on a large amount of text, images, and videos from the Internet, which contains information from various fields such as law, medicine, science, art, etc. This extensive training allows these big models to be used as base models to more easily build other models.
This turning point will surely inspire the emergence of all kinds of models, whether cognitive models (how to observe and express), behavioral models (how to drive a car), or models for specific fields (how to design semiconductor chips). Models are carriers of knowledge, and this turning point will make models and knowledge ubiquitous, accelerating us into a new round of technological innovation and ushering in a new era of machines such as self-driving cars, autonomous mobile robots, and humanoid robots, and their applications in various industries and deployment scenarios. These new paradigms will redefine the way humans and machines interact.
Multimodal LLM and the key role of vision
Through the Transformer model and its self-attention mechanism, AI can become truly multimodal, which means that AI systems can process input information from multiple modes such as speech, images, and text, just like people.
OpenAI's CLIP, DALL·E, Sora, and GPT-4o are some of the models that are moving towards multimodality. For example, CLIP is used to understand paired data of images and natural language, thereby building a bridge between visual and textual information; DALL·E aims to generate images based on text descriptions, and Sora can generate videos based on text, and is expected to become a global simulator in the future. OpenAI has taken the development of GPT-4o a step further. OpenAI uses text, vision, and audio information to train a single new model GPT-4o end-to-end, without the need to convert multimedia to text or vice versa. All inputs and outputs are processed by the same neural network, allowing the model to integrate audio, vision, and text information across modalities for real-time reasoning.
The future of multimodal AI will focus on the edge
AI innovators are constantly pushing the boundaries of how models can operate, thanks to advances in edge hardware (many of which are developed and designed based on the Arm platform), and to address latency issues, privacy and security requirements, bandwidth and cost considerations, and to ensure offline use when network connectivity is intermittent or nonexistent. As Sam Altman has also stated [1] , edge models are critical to providing an ideal user experience for video (what we perceive visually).
However, resource limitations, model size, and complexity challenges have hindered the transfer of multimodal AI to the edge. To address these issues, we need to combine hardware advances, model optimization techniques, and innovative software solutions to promote the popularization of multimodal AI.
Recent AI developments have had a profound impact on computer vision, which is of particular concern. Many vision researchers and practitioners are using large models and Transformers to enhance vision capabilities. In the era of large models, vision is becoming increasingly important. There are several reasons for this:
Machine systems must understand their surroundings through perception capabilities such as vision, providing necessary safety and obstacle avoidance capabilities for autonomous driving and robots that are related to human safety. Spatial intelligence is a hot area of research for researchers such as Fei-Fei Li, who is known as the "godmother of AI."
Vision is essential for human-machine interaction. AI companions need not only high IQ but also high EQ. Machine vision can capture human expressions, gestures, and movements to better understand human intentions and emotions.
AI models require visual capabilities and other sensors to collect real-world data and adapt to specific environments. As AI extends from light industry to heavy industry with lower levels of digitization, it is particularly important to collect data sets of physical world characteristics, build simulation environments or digital twins of the 3D physical world, and use these technologies to train large multimodal models so that the models can understand the real physical world.
Example of Vision+Base Model
Although ChatGPT is popular for its excellent language capabilities, as mainstream LLMs gradually evolve into multimodal, it may be more appropriate to call them "base models". The field of base models, including multiple modalities such as vision, is developing rapidly. Here are some examples:
DINOv2
DINOv2 is an advanced self-supervised learning model developed by Meta AI. It is based on the original DINO model and has been trained on a large dataset of 142 million images, which helps to improve its robustness and versatility in different visual domains. DINOv2 can segment objects without specialized training. In addition, it can generate universal features and is suitable for image-level visual tasks (such as image classification, video understanding) and pixel-level visual tasks (such as depth estimation, semantic segmentation), showing excellent generalization and versatility.
Segment Anything Model (SAM)
SAM is a generalizable segmentation system that can perform zero-shot generalization to unfamiliar objects and images without additional training. It can identify and segment objects in an image using a variety of input cues to clarify the target to be segmented. Therefore, it can run without special training when encountering each new object or scene. According to Meta AI, SAM can generate segmentation results in just 50 milliseconds, making it ideal for real-time applications. It has versatility and can be applied to many fields from medical imaging to autonomous driving.
Stable Diffusion
Stable Diffusion is a generative AI model that creates images from text descriptions. The model uses a technique called latent diffusion to operate on images in a compressed format in latent space, rather than directly in pixel space, to run efficiently. This approach helps reduce the computational load, allowing the model to generate high-quality images faster.
Stable Diffusion can already run on the edge of smart mobile devices. The above figure is an example of the Stable Diffusion optimization process:
Stable Diffusion's original settings are not suitable for running on mobile CPUs or NPUs (based on 512×512 image resolution).
It achieves over 60x speedup on CPU alone by using a smaller U-Net architecture, fewer sampling steps, switching to the ONNX format, applying quantization techniques (from FP32 to INT8), and other techniques. Many of these optimization techniques and tools were developed based on Arm’s extensive ecosystem. There is still room for further optimization of the model.
Achieving an excellent visual experience with multimodal LLM
As a member of Arm's Intelligent Vision Partner Program, Axera deployed DINOv2 Vision Transformer on the edge using its flagship chipset AX650N. The chip uses an Arm Cortex-A55 CPU cluster for pre-processing and post-processing, combined with Axera's mixed-precision NPU and Axera AI-ISP, which features high performance, high precision, easy deployment and excellent energy efficiency.
The following shows the effect of running DINOv2 on AX650N:
By pre-training on large and diverse datasets, visual Transformers can generalize better to new and unseen tasks, simplifying the retraining process and reducing tuning time. They can be applied to a variety of tasks beyond image classification, such as object detection and segmentation, without requiring extensive architectural changes.
Embracing the future of AI and human-machine interface
Thanks to the continued development of AI and LLM, we are at the intersection of the transformation of technology and human interaction. Vision will play a key role in this evolution, giving machines the ability to understand their surroundings and "survive" in the physical world to ensure safety and enhance interactivity. Driven by the rapid development of hardware and software, the shift to edge AI is expected to achieve efficient real-time applications.
* This article is an original article from Arm. Please leave a message to obtain authorization and indicate the source for reprinting.
[1] For the source, please refer to the original article.