Arm Kleidi technology enables cost-effective automatic speech recognition on the Arm Neoverse N2 platform

Latest update time：2024-11-08

Reads：

Click on the Arm Community above to follow us

Author: Yang Xile, Senior Software Product Manager, Arm China; Fred Jin, Senior Software Engineer

自动语音识别 (Automatic Speech Recognition) 技术已经深入到现代生活的方方面面，广泛应用于从语音助手、转录服务，到呼叫中心分析和语音转文本翻译等方面，为各行各业提供了创新解决方案，显著提升了用户体验。

With recent advances in machine learning (ML) and deep learning, automatic speech recognition technology has reached a new level of sophistication. Automatic speech recognition software can now understand a wide range of accents, dialects, and speaking styles with great accuracy. FunASR is an advanced open source automatic speech recognition toolkit developed by Alibaba DAMO Academy. It provides a comprehensive set of tools and models for developing and deploying automatic speech recognition systems.

FunASR is compatible with both CPU and GPU computing. While GPUs provide excellent performance for training deep learning models, CPUs are more common in edge and data center servers and are more suitable for model inference. Therefore, FunASR can perform efficient automatic speech recognition inference on CPUs and can be deployed smoothly in situations where GPU acceleration is not available (such as cost constraints, power consumption constraints, or lack of availability).

Arm Neoverse N2 is a high-performance CPU processor designed for cloud and edge computing. It can support a variety of cloud workloads including artificial intelligence (AI) and ML, and adds AI features such as SVE2, Bfloat16 (BF16) data format, and MMLA.

SVE2 enables developers to operate on larger data vectors, improving parallel processing capabilities and execution efficiency, which is particularly important for the large amount of mathematical calculations involved in the training and reasoning stages of AI models.

BF16 是一种较新的浮点格式，专为 AI 和 ML 应用而设计。它提供与 32 位浮点数相同的动态范围，但仅占用 16 位存储空间，有效缩小了模型尺寸，并显著提升了计算效率。

MMLA 是 Armv8.6 中的一个架构特性。它为 GEMM 运算提供了显著加速。GEMM 是 ML 中的一种基本算法，对两个输入矩阵进行复杂的乘法运算，得到一个输出。

Arm previously launched Arm Kleidi technology, a set of enabling technologies designed specifically for developers to enhance AI performance on Arm platforms such as Arm Neoverse and Arm Cortex. Kleidi technology covers a wide range of key aspects of AI development, from frameworks to highly optimized operator libraries to a vibrant independent software vendor (ISV) ecosystem.

In this article, we will share the deployment process and benchmarking methods of FunASR inference on Alibaba’s Yitian 710 platform based on Neoverse N2. At the same time, we will conduct a comparative analysis by enabling Arm Kleidi technology, highlighting the main advantages of running FunASR inference on Yitian 710 CPU in terms of cost-effectiveness compared to other CPU- and GPU-based platforms.

Benchmark Setup

Software Version:

Ubuntu 22.04 (64-bit)

PyTorch v2.3.0

pip install funasr==0.8.8

pip install modelscope==1.10.0

Model: speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

Please ensure that PyTorch and related python libraries ^[1] are installed on your system . If running on an Arm platform, you can use the PyTorch docker image ^[2] provided by Arm in the docker repository for quick evaluation.

Initialize the environment and import required dependencies

export OMP_NUM_THREADS=16

export DNNL_VERBOSE=1

import torch

import torch.autograd.profiler as profiler

import the

import random

import numpy as np

from funasr.tasks.asr import ASRTaskParaformer as ASRTask

from funasr.export.models import get_model

from modelscope.hub.snapshot_download import snapshot_download

<< Slide to view >>

Download and configure the model

Paraformer is an efficient automatic speech recognition model developed by Alibaba DAMO Academy in the FunASR open source project, which aims to improve the robustness and efficiency of end-to-end speech recognition systems. The model is based on the Transformer architecture and incorporates several innovations to improve its performance in speech recognition. For benchmarking, we will use the FunASR paraformer model from the Moda community ^[3] .

model_dir = snapshot_download('damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', cache_dir='./',revision=None)

#set the radom seed 0

random.seed(0)

np.random.seed(0)

torch.random.manual_seed(0)

model, asr_train_args = ASRTask.build_model_from_file(

'damo/speech_paraformers-large_asr_night-zh-cn-16k-common-vocab8404-pytorch/config.yaml','damo/speech_paraformers-large_asr_night-zh-cn-16k-common-vocab8404-pytorch/model.pb' ,'damo/speech_paraformers-large_asr_night-zh-cn-16k-common-vocab8404-pytorch/am.mvn' , 'cpu')

model = get_model(model, dict(feats_dim=560, onnx=False, model_name="model"))

<< Slide to view >>

Run with performance analyzer to get model inference results

The inference was run for ten iterations to obtain the average results.

batch = 64

seq_len = 93

dim = 560

speech = torch.randn((batch, seq_len, dim))

speech_lengths = torch.tensor([seq_len for _ in range(batch)], dtype=torch.int32)

with torch.no_grad():

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:

for _ in range(10):

model(speech, speech_lengths)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=200))

<< Slide to view >>

Accelerating inference using the BF16 Fast Math kernel

As part of Arm Kleidi technology, the Arm Compute Library (ACL) provides optimized BF16 general matrix multiplication (GEMM) kernels by leveraging BF16 MMLA instructions. These instructions are supported in Neoverse N2 CPUs and integrated into PyTorch via the oneDNN backend since PyTorch 2.0. The Fast Math GEMM kernels in ACL can highly optimize inference performance on CPUs.

To enable the Fast Math GEMM kernel, set the following environment variable before running inference:

$ export DNNL_DEFAULT_FPMATH_MODE=BF16

We found that enabling the BF16 Fast Math kernel on the Neoverse N2-based Yitian 710 platform resulted in a ~2.3x performance improvement over the default FP32 kernel.

Performance Comparison

We also compared the performance of the FunASR paraformer model on Yitian 710 and other cloud instances of the same level on Alibaba Cloud*.

Arm Neoverse N2 (Yitian 710):

ecs.c8y.4xlarge (16 vCPU + 32GB)

4th Generation Intel Xeon “Sapphire Rapids”:

ecs.c8i.4xlarge (16 vCPU + 32GB)

4th Generation AMD EPYC “Genoa”:

ecs.c8a.4xlarge (16 vCPU + 32GB)

^{* Yitian 710 [2]} using the armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl docker image , official PyTorch v2.3.0 for Intel Sapphire-Rapids and AMD Genoa

We found that the Neoverse N2-based Yitian 710, paired with the BF16 Fast Math core, enables the inference performance of the paraformer automatic speech recognition model to be up to 2.4 times better than that of the x86 cloud instance of the same level.

In actual inference deployment, cost is one of the main considerations for AI deployment, which has a great impact on the implementation and adoption of technology. In order to fully understand the total cost of ownership (TCO) of automatic speech recognition inference deployment on CPU and GPU platforms, we also included NVIDIA A10 GPU in the comparative analysis. Thanks to the excellent performance and energy efficiency of Neoverse N2, the Yitian 710 platform shows higher cost-effectiveness compared to the same level of x86 instances and GPU platforms, which is also reflected in the more affordable pricing of Alibaba Cloud Yitian 710 instances.

从基准测试结果来看，倚天 710 在自动语音识别推理部署的 TCO 方面具有显著优势，其性价比较同等级别 x86 和 GPU 平台高出 3.5 倍。

in conclusion

Alibaba's Yitian 710 based on Arm Neoverse N2 has specific ML features such as BF16 MMLA extensions, which provide excellent inference performance for the FunASR paraformer model using Arm Kleidi technology. Developers can achieve higher cost-effectiveness when building automatic speech recognition applications on Yitian 710.

Reference Links:

[1] https://pytorch.org/get-started/locally/

[2] https://hub.docker.com/r/armswdev/pytorch-arm-neoverse

[3] https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch

* This article is an original article from Arm. Please leave a message to obtain authorization and indicate the source for reprinting.

Latest articles about

■ExecuTorch Beta Released to Accelerate Generative AI Development on the Edge on Arm Platforms

■The open source circle of friends continues to expand! Arm joins the OpenCloudOS operating system open source community

■Real-time low-light video enhancement using mobile neural networks

■Arm launches AI tools on GitHub to simplify the development and deployment of AI applications

■Haven't registered yet? You may miss a great opportunity to learn about the cutting-edge development of AI

■The Everest chip based on Arm architecture accelerates the ultimate video experience

■On November 5th in Shenzhen, listen to how Arm interprets edge intelligence in the era of big models

■Microsoft Azure Cobalt 100 virtual machines based on Arm Neoverse are now available, improving cloud service efficiency and performance

■Arm joins hands with MediaTek and vivo to enable next-generation AI smartphone user experience

■Armv9 Technology Lecture | IPC increased by 15%! Arm Cortex-X925 provides powerful performance for users' actual needs