To improve AI computing power, heterogeneous optimization is also key
Recently, the explosive debut of ChatGPT has really amazed people, allowing the general public to "intimately" experience the magical power of artificial intelligence (AI) and get a glimpse of the bright prospects of the intelligent future.
However, while AI is getting closer and closer to life, making people full of longing and excitement for the future, there are also calm analyzes pointing out that large-scale AI applications such as ChatGPT are also "gold-eating beasts", which bring not only It’s just the computing power consumption that makes people urgently call “AI computing power urgent” (some research shows that the scale of China’s intelligent computing power demand will enter the 10 trillion floating point calculations per second (ZFLOPS) level by 2026, reaching 1,271.4EFLOPS, from 2021 to 2026, the annual compound growth rate will reach 52.3% 1 ); there are also a wide range of AI application scenarios from the cloud, edge to terminal, making the IT operation environment increasingly complex and diverse, allowing various AI solutions to operate in heterogeneous It has become an urgent need for the platform to be convenient, easy to use and effectively optimized.
Heterogeneous computing mainly refers to the computing method of systems composed of computing units with different types of instruction sets and architectures. It is widely used in cloud data centers, edge computing scenarios, etc.
The rise of heterogeneous computing is closely related to workloads. Among application scenarios that can effectively take advantage of heterogeneous computing, the artificial intelligence scenario can be said to be one of the typical representative scenarios. Whether it is deep learning training or deep learning inference, a large number of Matrix operations require stronger support from heterogeneous computing; as AI applications rapidly move to the edge, the resulting cloud-edge collaboration puts forward higher requirements for heterogeneous computing. While computing platforms improve their own computing power, they also need to provide optimization strategies to help users better improve the performance of AI solutions and help AI applications reduce costs and increase efficiency.
Tencent Cloud innovatively creates TACO Kit,
Provide heterogeneous acceleration for AI applications
In order to help users cope with the challenges brought by increasingly complex heterogeneous environments to AI applications, Tencent Cloud innovatively launched the computing acceleration kit TACO Kit (TencentCloud Accelerated Computing Optimization Kit) , which provides full-stack software and hardware on heterogeneous hardware platforms. The solution model builds a new heterogeneous computing acceleration software service for AI solution designers, AI developers, and AI users, helping them leverage diversified heterogeneity, high-performance acceleration frameworks, offline virtualization technology, and flexible business models. , easily control multiple computing power, and help AI applications reduce costs and increase efficiency in all aspects and scenarios.
As the entrance to heterogeneous acceleration services, TACO Kit has a built-in AI inference acceleration engine TACO Infer , which can calculate different training and service frameworks, individual optimization practices and usage habits, different software versions and hardware preferences in AI applications. The features and advantages of acceleration, senseless access, robustness and ease of use help users solve the pain points of deploying and applying AI models in production environments in one stop.
Figure 1 AI inference acceleration engine TACO Infer
■ Insensible integration : can be transparently adapted to heterogeneous chips such as CPU, GPU, and NPU across platforms; respects user habits and does not need to change the model source format; does not require IR (Intermediate Representation, intermediate representation) conversion, no explicit operator Structural model friendly;
■ Based on native framework Runtime : Can be run based on a variety of popular native frameworks, including TensorFlow, PyTorch, ONNXRuntime, etc.; can be built based on the original Runtime of the framework, and can make full use of the framework's customized extension mechanism;
■ Seamless docking service framework : including TF Serving, Triton and TorchServe, etc.
Based on the above characteristics, no matter what scenario, users deploy AI applications on the hardware platform, they only need to perform simple front-end interaction, and TACO Kit can start the workload in the optimal mode in the background and obtain better inference. performance.
The achievement of this excellent reasoning performance is inseparable from the in-depth collaboration between Intel and Tencent Cloud on TACO Kit. Specifically, Intel® Neural Compressor is integrated into TACO Kit to greatly improve AI reasoning performance and accelerate the convenient and efficient implementation of various AI applications .
Powered by Intel® Neural Compressor
Optimized support to help TACO Kit accelerate inference
Intel® Neural Compressor is Intel's open source neural network model compression library. It not only provides a unified interface across multiple deep learning frameworks for mainstream model compression technologies such as quantization, pruning, and knowledge extraction, but also has the following model performance tuning features :
■ It has an automated adjustment strategy driven by accuracy to help users quickly obtain the best quantitative model;
■ Predefined sparsity targets can be used to generate pruning models to implement different weight pruning algorithms;
■ Ability to extract knowledge from a larger network ("teacher") and use it to train a smaller network ("student"), achieving smaller accuracy loss.
Intel and Tencent Cloud collaborated to integrate Intel® Neural Compressor into TACO Kit through a plug-in , allowing TACO Kit to make full use of the advantages and features of Intel® Neural Compressor . As shown in Figure 2, quantization compression technology is used to provide a unified model optimization API for different deep depth frameworks (such as TensorFlow, PyTorch, ONNXRuntime, etc.) to easily implement model inference optimization (quantized from FP32 data type to INT8 data type). At the same time, you can also use the built-in accuracy tuning strategy of the compression library to generate quantization models with better accuracy based on different model internal structures, helping users significantly reduce the technical threshold for model quantification and effectively improve the inference efficiency of AI models.
Figure 2 TACO Kit workflow after integrating Intel® Neural Compressor
When deployed in the cloud, the quantized model can obtain effective hardware acceleration and higher inference efficiency through Intel® DL Boost built into the Intel® Xeon® Scalable Platform . Taking the vpdpbusd instruction in the instruction set as an example, the 64 multiplication and addition processes that used to require 3 instructions (vpmaddubsw, vpmaddwd, vpaddd) now only require 1 instruction (vpdpbusd), and can eliminate the processor during operation. The saturation problem, coupled with the fact that intermediate values during multiplication and addition are broadcast directly from memory, can lead to processing performance up to 4 times that of the original FP32 model2 . This undoubtedly provides key assistance for TACO Kit to accelerate inference and help users build and deploy AI more efficiently in heterogeneous environments.
Figure 3 Intel® DL Boost (AVX-512_VNNI) technology
Solution verification shows real performance,
Exhibiting the advantages of heterogeneous AI acceleration
So, how amazing is the performance of TACO Kit after integrating Intel® Neural Compressor? Practice has the most say and data is the most convincing. After the kit was built, Intel and Tencent Cloud selected a variety of widely used natural language processing deep learning models to verify and test the performance acceleration of TACO Kit.
In the test, after each deep learning model was optimized through TACO Kit, Intel® Neural Compressor was used for INT8 quantization and performance tuning, and the inference performance acceleration results were satisfactory. As shown in Figure 43 , while keeping the accuracy level basically unchanged, the inference performance of each deep learning model has been significantly improved, ranging from 55% to 139%, among which bert-base-uncased- In the mrpc scenario, the inference performance reached 2.39 times the baseline value.
Commenting on the substantial performance acceleration achieved by introducing Intel® Neural Compressor to TACO Kit , Tencent Cloud heterogeneous computing expert engineer Ye Fan said bluntly that the results of this cooperation can help users in different roles achieve convenience, ease of use, and effective processing on heterogeneous hardware platforms. Optimized AI acceleration capabilities help AI applications achieve all-round cost reduction and efficiency improvement in all scenarios. Intel® Neural Compressor is an effective technical guarantee for sufficient performance acceleration of AI inference workloads in TACO Kit.
Based on this achievement, Intel and Tencent Cloud will continue to deepen their cooperation in the future, driving TACO Infer to continuously iteratively optimize software and hardware compatibility and performance by integrating hardware manufacturers' optimization operators and self-developed AI compilation technology upgrades. At the same time , the two parties also plan to further integrate the fourth - generation Intel® Accelerate new technologies to provide users with more efficient and usable heterogeneous AI acceleration capabilities, thereby promoting AI to wider applications, while helping to deal with more severe challenges to computing power such as multi-modal large models, and driving intelligent applications in depth evolution, providing strong digital productivity for high-quality economic and social development.
[1] "2022-2023 China Artificial Intelligence Computing Power Development Assessment Report", https://www.inspur.com/lcjtww/resource/cms/article/2448319/2734787/2022122601.pdf
[2] Test configuration:
Test platform: Tencent S6 CVM Instance; Operating system: CentOS 7.9.2009 (Core); System configuration: Intel(R) Xeon(R) Platinum 8374C processor @2.7GHz, 16 CPUs/2 Threads per core/1 Socket/ 1 Numa Node, 32GB RAM; TACO version: v2.6 (Onnxruntime v1.12.0, oneDNN v2.3.0); Workload: Onnx model
[3] Test configuration:
Test platform: Tencent S6 CVM Instance; Operating system: CentOS 7.9.2009 (Core); System configuration: Intel(R) Xeon(R) Platinum 8374C processor @2.7GHz, 16 CPUs/2 Threads per core/1 Socket/ 1 Numa Node, 32GB RAM; TACO version: v2.6 (Onnxruntime v1.12.0, oneDNN v2.3.0); Workload: Onnx model
[4] Test configuration:
Test platform: Tencent S6 CVM Instance; Operating system: CentOS 7.9.2009 (Core); System configuration: Intel(R) Xeon(R) Platinum 8374C processor @2.7GHz, 16 CPUs/2 Threads per core/1 Socket/ 1 Numa Node, 32GB RAM; TACO version: v2.6 (Onnxruntime v1.12.0, oneDNN v2.3.0); Workload: Onnx model
Want to see more "core" information
Tell us
with your
likes
and
watching
~
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands mentioned in this article are the property of their respective owners.