Reduce speech transcription costs by up to 90% using interactive artificial intelligence (CAI)-EEWORLD

Collect

Reduce speech transcription costs by up to 90% using interactive artificial intelligence (CAI)

Introduction to interactive artificial intelligence (CAI)

What is interactive artificial intelligence (AI)?

Conversational artificial intelligence (CAI) uses deep learning (DL), a subset of machine learning (ML), to automate speech recognition, natural language processing, and text-to-speech through machines. The CAI process is usually described by three key functional modules:

1. Speech-to-text (STT), also known as automatic speech recognition (ASR)

2 Natural Language Processing (NLP)

3 Text-to-speech (TTS) or speech synthesis

Figure 1: Interactive AI building blocks

This white paper details the use cases for Automatic Speech Recognition (ASR) and how Achronix can reduce the associated costs by up to 90% while implementing ASR solutions.

Market segments and application scenarios

With more than 110 million virtual assistants in use in the United States alone[1], most people are familiar with using CAI services. Key examples include voice assistants on mobile devices, such as Apple’s Siri or Amazon’s Alexa; voice search assistants on laptops, such as Microsoft’s Cortana; automated call center answering assistants; and voice-enabled devices, such as smart speakers, televisions, and cars.

The deep learning algorithms that power these CAI services can be processed locally on electronic devices or aggregated in the cloud for remote, large-scale processing. Large-scale deployments that support millions of user interactions are a huge computational processing challenge, and hyperscale providers have developed specialized chips and devices to handle these services.

Today, most small businesses can easily add voice interfaces to their products using cloud APIs provided by companies such as Amazon, IBM, Microsoft, and Google. However, when the scale of these workloads increases (a specific example is described later in this whitepaper), the cost of using these cloud APIs becomes prohibitive, forcing businesses to seek other solutions. In addition, many enterprise operations have higher data security requirements, so solutions must remain within the enterprise's data security scope.

Enterprise-level CAI solutions can be used in the following application scenarios:

• Automated call center

• Voice and video communication platform

• Health and medical services

• Financial and banking services

• Retail and vending equipment

ASR processing in detail

ASR is the first step in the CAI process, where speech is transcribed into text. Once the text is available, it can be processed in a variety of ways using Natural Language Processing (NLP) algorithms. NLP includes key content identification, sentiment analysis, indexing, contextualizing content, and analytics. In an end-to-end interactive AI algorithm, speech synthesis is used to generate natural voice responses.

State-of-the-art ASR algorithms are implemented using end-to-end deep learning. Unlike convolutional neural networks (CNNs), recurrent neural networks (RNNs) are common in speech recognition. As David Petersson from TechTarget [10] points out in the article “CNN vs. RNN: How Are They Different?”, RNNs are better suited for processing temporal data and are well suited for ASR applications. RNN-based models require high computational power and memory bandwidth to process neural network models and meet the strict latency targets required for interactive systems. When real-time or automated responses are too slow, they appear sluggish and unnatural. Low latency is often achieved only at the expense of processing efficiency, which increases costs and can become too large for practical deployment.

Achronix has collaborated with Myrtle.ai, a professional technology company that uses field-programmable gate arrays (FPGAs) for AI reasoning. Myrtle.ai uses its MAU reasoning acceleration engine to implement high-performance RNN-based networks on FPGAs. The design has been integrated into the Achronix Speedster®7t AC7t1500 FPGA device, which can take advantage of the key architectural advantages of the Speedster7t architecture (which will be discussed later in this white paper) to greatly improve the accelerated processing of real-time ASR neural networks, thereby increasing the number of real-time data streams (RTS) that can be processed by 2500% compared to server-level central processing units (CPUs).

Data Accelerator: How to Achieve a Balanced Allocation of Resources

Data accelerators can significantly reduce server footprint by offloading compute, network, and/or storage processing workloads typically performed by the main CPU. This white paper describes replacing up to 25 servers with one server and an Achronix ASR-based accelerator card. This architecture significantly reduces workload cost, power consumption, and latency while increasing workload throughput. However, using data acceleration hardware to achieve high performance and low latency only makes sense if the hardware is used efficiently and deployment is cost-effective.

ASR models are challenging for modern data accelerators and often require manual tuning to achieve performance better than the single-digit efficiency of the platform's main performance specifications. Real-time ASR workloads require high memory bandwidth as well as high-performance computing. The data required for these large neural networks is typically stored in DDR memory on the accelerator card. Transferring data from external memory to the compute platform is a performance bottleneck in this workload, especially when deployed in real time.

Graphics Processing Unit (GPU) architecture is based on a data parallel model, and smaller batch sizes result in lower utilization of GPU acceleration hardware, leading to increased costs and reduced efficiency. Performance data in hardware acceleration solution data sheets (measured in TOPS (tera operations per second)) are not always a good representation of actual performance, as many hardware acceleration devices are underutilized due to bottlenecks related to the device architecture. These data, measured in TOPS, emphasize the processing power of the accelerator's compute engine, but ignore key factors such as the batch size, speed, and size of external memory, and the ability to transfer data between external memory and the accelerator's compute engine. For ASR workloads, focusing on memory bandwidth and efficiently transferring data within the accelerator provides a stronger guide to achieving accelerator performance and efficiency.

Accelerators must have larger external memory sizes and very high bandwidth. Today's high-end accelerators typically use high-performance external memory with a storage size of 8-16 GB and a running speed of up to 4 Tbps. It must also be able to transmit this data to the computing platform without affecting performance. However, no matter how to implement the data channel between high-speed storage and the computing engine, it is almost always a bottleneck for system performance, especially in low-latency applications such as real-time ASR.

FPGAs are designed to provide optimal data routing between storage and compute, thus providing an excellent acceleration platform for these workloads.

Comparison of Achronix solutions with other FPGA solutions

In the field of machine learning (ML) acceleration, existing FPGA architectures claim to have inference speeds of up to 150 TOPS. However, in real-world applications, especially those that are latency-sensitive (such as ASR), these FPGAs are far from achieving their claimed maximum inference speeds due to the inability to efficiently transfer data between the computing platform and external memory. This performance loss is caused by a bottleneck when data is transferred from external memory to the computing engine in the FPGA device. The Achronix Speedster7t architecture strikes a good balance between the computing engine, high-speed memory interface, and data transfer, allowing Speedster7t FPGA devices to provide high performance for real-time, low-latency ASR workloads, achieving 64% of the maximum TOPS rate.

Figure 2: Computation, storage, and data transfer capabilities of the Speedster7t device

How the Speedster7t architecture achieves higher computing efficiency

The Machine Learning Processor (MLP) on the Speedster7t is an optimized matrix/vector multiplication module that can perform 32 multiplications and 1 accumulation in a single clock cycle and is the foundation of the compute engine architecture. The Block RAM (BRAM) in the AC7t1500 device is co-located with all 2560 MLP instances, which means lower latency and higher throughput.

With these key architectural units, Myrtle.ai’s MAU low-latency, high-throughput ML inference engine has been integrated into the Speedster7t FPGA device.

When building the best ASR solution, the previously mentioned MAU inference engine from Myrtle.ai was integrated, using 2000 of the 2560 MLPs. Since the MLP is a hard module, it can run at a higher clock rate than the FPGA logic array itself.

Figure 3: Machine learning processor

Eight GDDR6 memory controllers are used in the AC7t1500 device, which together provide up to 4 Tbps of bidirectional bandwidth. As mentioned above, powerful computing engines and large-capacity, high-bandwidth storage rely on high-speed, low-latency, and deterministic data transfer to provide the real-time results required by low-latency ASR applications.

This data then enters the Speedster7t's two-dimensional network-on-chip (2D NoC). The 2D NoC is another hard structure in the Speedster7t architecture that clocks at up to 2 GHz and interconnects all I/Os, internal hard blocks, and the FPGA logic array itself. With a total bandwidth of 20 Tbps, the 2D NoC provides the highest throughput and, with the proper implementation, the most deterministic, low-latency data transfer between external GDDR6 memory and the MLP-enabled compute engines.

[1] [2]

Reference address：Reduce speech transcription costs by up to 90% using interactive artificial intelligence (CAI)

Previous article：Habana Gaudi2 outperforms NVIDIA A100, helping to achieve efficient AI training
Next article：Codasip Joins Intel Pathfinder for RISC-V Design Support Program

Popular Resources
Popular amplifiers