Turning intentions into action: Entering a new era of embedded voice control-EEWORLD

Collect

NXP releases a speech recognition engine for a new generation of smart speech technology portfolio. In this blog post, we’ll explore the challenges developers face in designing embedded voice controls, our new Speech to Intent engine, and how you can use it in your apps.

Hearing Your Voice: Challenges of Voice Commands in Embedded Systems

With companies like Amazon, Google, and Apple launching revolutionary smart speakers, embedded voice-controlled devices have become a hot trend, and the technology has actually been around for years. With these smart speakers, end users experience the convenience, practicality and intuitiveness of a voice-first device for the first time. Voice is the user interface (UI) of these devices and their most important or only mode of interaction. Leveraging natural language understanding technology in the cloud, smart speakers allow end users of voice-first devices to communicate with smart devices using natural language, so that requests, queries, and commands can be understood and responded to.

To implement natural language processing, designers and end users face several challenges, such as the requirement for stable, reliable network connections and the high power consumption of an always-on, always-listening device, not to mention the potential for such connected devices to Comes the privacy risk.

To address the speech engine challenges in embedded design, NXP has launched the VIT Speech to Intent engine, the latest product of its Intelligent Voice Technology ( VIT ) product portfolio . Learn more about VIT S2I .

Local voice control vs. cloud-based voice control

To make a device voice-controlled, engineers typically have three options: process it locally, process it in the cloud, or a combination of the two, which we call “hybrid processing.” With local voice control, end devices process all voice locally at the edge without having to connect to the cloud or remote servers for secondary processing. Cloud-based processing uses the computing power of the cloud to process voice audio, and then transmits the response generated by the cloud back to the device through the network. In the case of hybrid processing, a local wake word engine is typically used to wake the device (such as "Hey NXP"), and then all voice commands following that wake word are streamed to the cloud or remote server for processing.

Local processing has the advantages of low latency, low power consumption, and network independence, but it typically only supports basic keywords and commands that require precise wording. For example, turning on the lights might require the exact phrase "Hey, NXP (wake word), turn on the lights (voice command)" and there can be no variations.

For cloud processing and hybrid systems, the use of cloud services increases latency but offers the advantage of being able to run extremely complex algorithms, including natural language understanding models. Revisiting the example of turning on the lights just mentioned, using any combination of words, the system can understand the environment of the required operation, such as "It is dark here, please turn on the lights."

As mentioned earlier, a major drawback of cloud-based natural language processing is security and privacy concerns. Simply put, the principle of this method is to transmit the voice audio stream through the network to a remote server for processing, but this may also cause the system to accidentally start and transmit irrelevant audio streams to the cloud. These audio streams may include personal conversations, credentials, or other sensitive information.

Introduction to NXP Intelligent Voice Technology ( VIT ) Speech to Intent ( S2I ) Engine

To address the speech engine challenges in embedded design, NXP has launched the VIT Speech to Intent engine, the latest product of its Intelligent Voice Technology (VIT) product portfolio. The S2I engine is the high-end product of VIT's product portfolio, which also includes the free wake word engine (WWE) and voice command engine (VCE).

Unlike systems that rely on remote cloud services, VIT S2I is able to determine natural language intent locally. This capability is thanks to NXP’s latest developments in neural network algorithms and machine learning models designed for embedded systems. Therefore, the purpose of "turning on the lights" can be expressed in many different ways, such as "turn on the lights", "it's too dark" and "can you make the light brighter", etc.

This Speech to Intent capability enables users to interact with embedded systems more naturally while reducing system latency and power consumption of cloud-connected systems. Additionally, eliminating cloud services also helps improve security and privacy because all voice is processed locally on the device. In addition, if paired with the NXP wake word engine, ultra-low power consumption designs can be developed. Only after hearing a specific wake word, the VIT S2I engine will be started to process voice commands.

NXP devices supporting VIT S2I include: Arm® ^Cortex® - ^M :i.MX RT crossover MCUs and RW61x MCUs, as well as Cortex A i.MX 8M Mini, i.MX 8MPlus and i.MX 9x applications processors. VIT S2I currently supports English, Mandarin and Korean and will be launched by the end of 2023. Online development tools for creating custom commands and training models are planned for release in 2024.

VIT Speech to Intent block diagram

How VIT Speech to Intent can add speech capabilities to your next design

The field of Internet of Things is changing with each passing day, and VIT S2I can adapt to various application scenarios, whether it is home automation, wearable electronics, automotive telematics and building access control, etc., it can give full play to its advantages. Consumers like to use natural language to control basic functions of their devices hands-free, and cloud services that eliminate edge voice processing not only reduce system latency, but also reduce privacy and security issues.

For those devices that require a voice-first user interface, the VIT S2I system is an indispensable part, and it can be used in smart thermostats, smart appliances, home automation, lighting control, sunshade control and other fields. VIT S2I is also suitable for wearables and fitness devices, some use cases include setting reminders, controlling Bluetooth devices and monitoring health.

Enhance your applications with NXP 's VIT portfolio

If you want to develop using NXP's intelligent voice technology portfolio, you are welcome to use our free VIT wake word and voice command engine, available through the MCUXpresso SDK and online model tools. These engines allow you to easily customize wake words and basic voice control, making them suitable for rapid prototyping and development cycles that don't involve natural language understanding. If your application requires more natural language understanding capabilities, contact your local NXP representative to get started with VIT Speech to Intent.

Learn more about NXP’s speech processing portfolio and watch our VIT Speech to Intent demo.

author:

Chris Welsh

Director, IoT Voice and Audio Business Development, Edge Processing Business Unit

As a partner of Retune DSP, Chris joined NXP during the company's merger and acquisition in 2021. Chris focuses on creating value for customers through differentiated voice software technology and services. Chris brings more than 25 years of experience in the embedded voice and audio business to NXP, having served as an engineer, business development, founder, general manager and Senior management and other positions. Chris holds a bachelor's degree in mechanical engineering from Purdue University and a master's degree in acoustics from Pennsylvania State University.

Keywords：embedded Reference address：Turning intentions into action: Entering a new era of embedded voice control

Previous article：CS5213 replaces AG6200 HDMI to VGA with audio solution
Next article：CS5216 solution DP to HDMI1080P adapter cable solution

Popular Resources
Popular amplifiers