Introducing OpenVINO™ 2024.0: providing developers with greater performance and expanded support

Latest update time：2024-03-15

Reads：

Author | Yury Gorbachev Intel Fellow OpenVINO™ Product Architect

Translation | Wu Zhuo Intel AI Software Evangelist

Hello, OpenVINO™ 2024.0

Welcome to OpenVINO™ 2024.0, where we are excited to introduce a series of enhancements designed to empower developers in the rapidly evolving field of artificial intelligence! This release enhances large language model (LLM) performance with dynamic quantization, improved GPU optimization, and support for hybrid expert architectures. OpenVINO™ 2024.0 enables developers to effectively take advantage of artificial intelligence acceleration and appreciates the continued contributions from the community.

OpenVINO™

Improvement of large language model inference

Large Language Models (LLMs) show no signs of disappearing, and models and use cases continue to emerge. We continue our mission to accelerate models and make inference on these models more affordable.

Performance and accuracy improvements

In this release, we've been working on improving LLM's out-of-the-box performance and making some important changes to the runtime and tooling.

First, we introduce the dynamic quantization and cache compression mechanism of the CPU platform. KV cache compression allows us to generate large sequences more efficiently. Dynamic quantization often increases the computational and memory consumption of other parts of the model (embedding maps and feedforward networks).

For GPU platforms, we are also improving build characteristics by introducing optimizations in the kernel and across the stack. We also implemented more efficient cache handling, which facilitates the use of beam search generation.

Second, while performance is always a topic of discussion, accuracy is also crucial. We improve the accuracy of the weight compression algorithm in NNCF. We introduce the ability to compress weights using statistics from a dataset and introduce an implementation of the AWQ algorithm to further improve accuracy. Additionally, through our integration with Hugging Face Optimum Intel, you can now compress models directly through the Transformers API as follows:

(Code source here:

https://github.com/huggingface/optimum-intel/pull/538 )

Note: Use the load_in_4bit option set to True and pass the quantiation_config permission in the call to the from_pretrained method, which will do all the compression for you. What's more, we've added quantization configurations for most popular models, including Llama2, StableLM, ChatGLM, and QWEN; so for these models, you don't need to pass config at all to get 4-bit compression.

To learn more about the quality of our algorithms, you can consult the OpenVINO™ documentation: https://docs.openvino.ai/nightly/weight_compression.html

Or the NNCF documentation on GitHub: https://github.com/openvinotoolkit/nncf/blob/a917efd684c2febd05032a8f2a077595fb73481a/docs/compression_algorithms/CompressWeights.md#evaluation-results

Supports Hybrid Expertise (MoE) architecture

Hybrid Expertise (MoE) represents the next major architectural evolution, bringing better accuracy and performance to LLM. It started with Mixtral and quickly evolved to more models and frameworks, allowing the creation of MoE-based models from existing models. Throughout the 2024.0 release, we have been working to enable these architectures and improve performance. Not only did we perform an efficient transformation of these models, but we also changed some internal structures to better handle the dynamic selection of experts within runtime.

We are working on an upgrade to Hugging Face Optimum-Intel so that the conversion of these models is transparent.

OpenVINO™

Changes to new platforms and enhancements to existing platforms

Wider access to Intel NPU

With the release of Intel®Core™Ultra, our NPU accelerator has finally met the majority of developers. This is an evolving product from both a software and hardware perspective, and we're excited about what it can do. You may have seen some demos of OpenVINO™ Notebooks running on NPUs

OpenVINO™ notebooks running on NPU:

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/230-yolov8-optimization

In this release, we are providing NPU support when you install OpenVINO™ through our most popular distribution channel, PyPI. There are a few points to note:

The NPU requires drivers to be installed on your system, so if you plan on using it, make sure to follow this short guide:

https://docs.openvino.ai/nightly/openvino_docs_install_guides_configurations_for_intel_npu.html
NPU is currently not included in the automatic device selection logic ( https://docs.openvino.ai/nightly/openvino_docs_OV_UG_supported_plugins_AUTO.html ), so if you plan to run your model on an NPU, make sure you specify the device name explicitly (e.g. NPU) as shown below:

compiled_model = core.compile_model(model=model, device_name="NPU")

Improved support for ARM CPUs

Threading is one of the things that we haven't implemented effectively on the ARM platform, which is slowing down our performance. We worked with the oneTBB team (our default threading engine provider) to change support for ARM and significantly improve our performance. At the same time, after doing some research on the precision of certain operations, we enabled fp16 as the inference precision by default on ARM CPUs.

总的来说，这意味着 ARM CPU 的性能更高，也意味着 OpenVINO Streams 功能的实现( https://docs.openvino.ai/nightly/openvino_docs_deployment_optimization_guide_tput_advanced.html#openvino-streams )，该功能允许在多核平台上获得更高的吞吐量。

OpenVINO™

删除一些遗留项

2024.0 是我们的下一个主要版本，传统上这是我们从工具套件中删除过时组件的时候。

2年前，我们大幅度改变了 API 以跟上深度学习领域的发展。但为了最大限度地减少对使用OpenVINO™的现有开发者和产品的影响，我们也支持 API 1.0。从那以后发生了很多变化，我们现在 正在完全删除旧 的 API 。更重要的是，我们还删除了标记为弃用的工具。这包括：

训练后量化工具，也称为POT
准确性检查框架
部署管理器

这些工具是 openvino-dev 包的一部分，这个包已经有一段时间没有强制使用了。我们将为那些继续使用我们的离线模型转换工具model Optimizer的用户保留它。

如果您无法迁移到新的API，那么您很有可能继续使用我们的一个长期支持版本，例如2023.3。

OpenVINO™

新的及修改过的 Notebooks

我们将继续展示人工智能领域最重要的更新，以及如何利用OpenVINO™来加速这些场景。以下是我们一直在做的工作：

Mobile language assistant with MobileVLM

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/279-mobilevlm-language-assistant
Depth estimation with DepthAnything

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/280-depth-anything
Multimodal Large Language Models (MLLM) Kosmos-2

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/281-kosmos2-multimodal-large-language-model
Zero-shot Image Classification with SigLIP

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/282-siglip-zero-shot-image-classification
Personalized image generation with PhotMaker

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/283-photo-maker
Voice tone cloning with OpenVoice

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/284-openvoice
Line-level text detection with Surya

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/285-surya-line-level-text-detection
Zero-shot Identity-Preserving Generation with InstantID

https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/286-instant-id
LLM chatbot 和 LLM RAG pipeline已通过新模型的集成进行了更新：minicpm-2b-dpo、gemma-7b-it、qwen1.5-7b-chat、baichuan2-7b-chat

https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb

https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot/254-rag-chatbot.ipynb

OpenVINO™

感谢您，我们的开发者和贡献者 ！

在 OpenVINO™ 的历史上，我们看到了许多激动人心的项目！我们决定列出一份使用OpenVINO™ 的绝妙项目列表

( https://github.com/openvinotoolkit/awesome-openvino )，它还在继续快速增长着！为您的项目创建一个拉取请求，使用您项目的 “mentioned in Awesome” 徽章，并与我们分享您的经验！

Our developer base is growing and we appreciate all the changes and improvements the community is making. Surprisingly, some of you have made it clear that you are "busy helping to improve OpenVINO™" , thank you! ????

An example of the work of our contributors is OpenVINO™ support in the openSUSE platform.

https://en.opensuse.org/SDB:Install_OpenVINO

However, over the past few weeks we've faced a major problem - we can't populate Good First Issues and review pull requests fast enough! We recognize this problem and will work harder to solve it. We hope everyone will continue to pay attention.

Additionally, we are preparing for the Google Summer of Code ( https://github.com/openvinotoolkit/openvino/discussions/categories/google-summer-of-code ) and have great fun hearing from you project proposal! There is still time to submit your idea before we send it out for approval.

In this release, our beloved contributor list is published on GitHub:

https://github.com/openvinotoolkit/openvino/releases/tag/2024.0.0

Notices and Disclaimers

Performance varies based on usage, configuration, and other factors. To learn more, visit the Performance Search website:

https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/

Performance results are based on testing as of the date shown in the configuration and may not reflect all publicly available updates. See Backup for configuration details. No product or component is completely safe.

Your costs and results may vary. Intel technologies may require supported hardware, software, or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.