Design and implementation of an embedded speech recognition module-EEWORLD

Collect

Service robots are designed to provide services, so people need a more convenient, natural, and humane way to interact with robots, rather than being satisfied with complex keyboard and button operations. Human-computer interaction based on hearing is an important development direction in this field. The current mainstream speech recognition technology is based on statistical models. However, due to the complexity of the statistical model training algorithm and the large amount of calculation, it is generally completed by industrial computers, PCs, or notebooks, which undoubtedly limits its application. Embedded voice interaction has become a hot topic of research.

　　Compared with the PC speech recognition system, although the embedded speech recognition system has certain limitations in computing speed and memory capacity, it has the advantages of small size, low power consumption, high reliability, low investment, and flexible installation. It is particularly suitable for smart homes, robots, consumer electronics and other fields.

　　1 Module overall solution and architecture

　　The basic principle of speech recognition is shown in Figure 1. Speech recognition consists of two stages: training and recognition. Whether it is training or recognition, the input speech must be preprocessed and feature extracted. The specific work done in the training stage is to obtain feature vector parameters after preprocessing and feature extraction by the user inputting several training speech, and finally to establish a reference model library for the training speech through feature modeling. The main work done in the recognition stage is to compare the feature vector parameters of the input speech with the reference model in the reference model library by similarity measurement, and then output the input feature vector with the highest similarity as the recognition result. In this way, the purpose of speech recognition is finally achieved.

The basic principles of speech recognition

Figure 1 Basic principles of speech recognition

　　Existing speech recognition technologies can be divided into specific person recognition and non-specific person recognition according to the recognition object. Specific person recognition refers to the recognition object being a specific person, while non-specific person recognition refers to the recognition object being the majority of users. Generally, it is necessary to collect the voices of multiple people for recording and training, and after learning, a higher recognition rate can be achieved.

　　There are currently two main ways to develop embedded voice interaction systems based on existing technologies: one is to call the voice development package directly in the embedded processor; the other is to expand the voice chip around the embedded processor. The first method has a large amount of program, complex calculations, requires a large amount of processor resources, and a long development cycle; the second method is relatively simple, only needs to focus on the interface part of the voice chip connected to the microprocessor, the structure is simple, easy to build, the calculation burden of the microprocessor is greatly reduced, the reliability is enhanced, and the development cycle is shortened.

　　Speech recognition technology is developing rapidly at home and abroad. At present, the representative products in the field of PC applications in China are: InterReco2.0 of iFlytek, Pattek ASR3.0 of Zhongke Pattern Recognition, jASRv5.5 of Jietong Huasheng; in the field of embedded applications, the representative products are: SPCE061A of Lingyang, LD332X of ICRoute, WS-117 of Shanghai Huazhen Electronics.

　　The speech recognition scheme in this paper is based on an embedded microprocessor as the core, with a non-specific person speech recognition chip and related circuits as the periphery. The speech recognition chip uses the LD33 20 chip from ICRoute.

　　2 Hardware Circuit Design

　　As shown in Figure 2, the hardware circuit mainly includes the main control core part and the speech recognition part. After the speech enters the speech recognition part, the processed data is transmitted in parallel to the main controller. After processing, the main controller sends the command data to USART. USART can be used to expand peripheral serial devices, such as speech synthesis modules.

Hardware Circuit

Figure 2 Hardware circuit

　　2.1 Speech Recognition Circuit

　　Figure 3 is a schematic diagram of the speech recognition part, which was designed with reference to the LD3320 data sheet released by ICRoute. The LD3320 integrates a fast and stable optimization algorithm, does not require external Flash, RAM, and does not require user training and recording in advance to complete non-specific person speech recognition, with high recognition accuracy.

Schematic diagram of speech recognition

Figure 3 Schematic diagram of speech recognition

　　In the figure, LD3320 is directly connected to STM32F103C8T6 in parallel, and 1kΩ resistor is used to pull up. A0 is used to determine whether it is a data segment or an address segment. The control signal , reset signal and interrupt return signal INTB are directly connected to STM32F103C8T6, and 10kΩ resistor is used to pull up to assist the system to work stably. The same external 8 MHz clock is used as STM32F103C8T6. The light-emitting diodes D1 and D2 are used for power-on indication after reset. MBS (pin 12) is used as the * bias, and an RC circuit is connected to ensure that a floating voltage can be output to the *.

　　2.2 Main Controller Circuit

　　The main controller of this article is the STM32F103C8T6 chip from ST. This chip is based on the ARM Cottex-M3 32-bit RISC core, with an operating frequency of up to 72 MHz, built-in high-speed memory (64 KB flash memory and 20 KB SRAM), rich enhanced I/O ports and peripherals connected to two APB buses. The STM32 series provides a new 32-bit product option, combining high performance, real-time, low power consumption, low voltage and other features, while maintaining the advantages of high integration and easy development, bringing the performance and power efficiency of the 32-bit MCU world to a new level. [page]

　　3 Software System Design

　　The design of the software system mainly includes three parts: the transplantation of the embedded operating system μC/OS-II of the main control unit, the design of the speech recognition program of LD3320, and the design of the dialogue management unit.

　　3.1 Embedded Operating System μC/OS-II Porting

　　μC/OS-II is a real-time multi-tasking operating system with open source, portable, curable, customizable, and preemptive. It is specially designed for embedded computer applications. Most of the code is written in C language. It has the characteristics of high execution efficiency, small footprint, excellent real-time performance, and strong scalability. The minimum kernel can be as small as 2 KB. In μC/OS-II, the concept of task is particularly important. It is a preemptive kernel, so the division of task priorities is crucial. Based on the hierarchical and modular design concept, the division of the entire system tasks is listed in Table 1.

Table 1 Main control system task priority planning

Task priority planning of the main control system

　　In Table 1, except for OSTaskStat and OSTaskIdle tasks, which are built-in by the system, the other seven tasks are created by users. App_TaskStart is the first task of the system, which initializes the system clock and underlying devices, creates all events and other user tasks, and monitors the system status; App_TaskSR completes speech recognition; App_TaskCmd completes the parsing and execution of commands in the dialogue set, and sends them out through USART1; App_TaskCom, as a peripheral expansion task, sends instructions or data out through USART2, and is responsible for controlling peripheral expansion devices, such as speech synthesis devices, etc.

　　App_TaskUpdate updates the dialog set by parsing the commands and data received by USART1; App_TaskPB is a key scanning task responsible for detecting 3 independent keys, divided into short press and long press detection; App_TaskLed drives 4 LED indicators to indicate the current working status.

　　3.2 Speech Recognition Programming

　　The design of the speech recognition program refers to the LD332X development manual. This article uses the interrupt mode, and its workflow is divided into general initialization - initialization for speech recognition - writing the recognition list - starting recognition - responding to interrupts.

　　① General initialization and initialization for speech recognition. In the initialization program, soft reset, mode setting, clock frequency setting, and FIFO setting are mainly completed.

　　②Write the recognition list. The rule of the list is that each recognition entry corresponds to a specific number (1 byte). The numbers can be the same or not continuous, but the value must be less than 256 (00H~FFH). This chip supports up to 50 recognition entries. Each recognition entry is the standard Mandarin Chinese pinyin (lowercase), and there is a space between every two characters (Chinese pinyin). This article adopts recognition entries with different continuous numbers. Table 2 is a simple example.

Table 2 Identification list example

Identification List Example

　　③ Start recognition. Set several related registers to start voice recognition. Figure 4 is the relevant process. The ADC channel is the * input channel, and the ADC gain is the * volume. The set value is 00H~7FH. The recommended setting value is 40H~6FH. The larger the value, the louder the MIC volume, and the more sensitive the recognition start, but it may cause more false recognition; the smaller the value, the smaller the MIC volume, and you need to speak at a close distance to start the recognition function. The advantage is that it does not respond to interference voices from a distance. The setting value in this article is 43H.

Start the identification process

Figure 4: Start the recognition process

　　④ Respond to interrupts. If the sound is collected, an interrupt signal will be generated regardless of whether the normal result is recognized. The interrupt program will analyze the result according to the value of the register. By reading the value of the BA register, we can know how many candidate answers there are, and the answer in the C5 register is the one with the highest score and the most likely correct answer.

　　3.3 Design of Dialogue Management Unit

　　In order to facilitate the management of dialogue, a dialogue management unit is designed in this paper to store sentences waiting to be recognized and commands waiting to be executed, which is implemented by defining a two-dimensional array in the main controller. LD3320 can set up to 50 candidate recognition sentences each time, and each recognition sentence can be a single word, phrase or short sentence, with a length of no more than 10 Chinese characters or 79 bytes of pinyin string. Based on the above reasons, the dialogue management array designed in this paper is listed in Table 3.

Table 3 Dialog management unit array

Dialog management unit array

　　The behavior array stores the behavior numbers to be executed, corresponding to 50 speech recognition statements. There are 50 groups of instructions in total, each group of instructions can contain up to 6 behaviors. Parallel behaviors can be grouped as one step. By combining multiple behaviors, more complex tasks can be completed.

　　4 Performance Testing and Application

　　In order to ensure the speech recognition rate, stability and response time of the designed speech recognition module, this paper conducted corresponding tests on the described speech recognition module. The test environments were a quiet home environment and a noisy hospital environment. There were 8 voice commands in total, and each voice command was tested 10 times. The total number of experiments for each specific person in each environment was 80 times, and the number of successful recognitions was recorded. The test results are listed in Table 4.

Table 4 Test results

Test Results

　　Among the three non-specific persons in the test, non-specific person 1 is female, non-specific person 2 and non-specific person 3 are male. It can be seen from the data in the table that the speech recognition rate of non-specific persons in a home environment can reach more than 90%, and the speech recognition rate in a noisy hospital environment can also reach more than 82.5%. In terms of recognition rate, the speech recognition rate in a noisy environment is lower than that in a quiet environment; in terms of stability, the system has better stability in a quiet environment, and the module can make a correct response after the voice is said once or at most twice; in a noisy environment, the stability of the system is reduced, and some voice commands need to be said three times or more before they can be accurately recognized by the module; in terms of real-time performance, the voice in a quiet environment can ensure the real-time response of the system, and the response time generally does not exceed 1 s, while the response time in a noisy environment is relatively longer.

　　Conclusion

　　This paper discusses the design and implementation of an embedded speech recognition module based on STM32, and gives a detailed introduction to the hardware circuit and software implementation of each component of the module. A large number of experiments and practical applications show that the speech recognition module designed in this paper has the characteristics of good stability, high speech recognition rate, strong anti-noise interference ability, simple structure and easy use. The module is highly practical and can be widely used in many fields such as service robot smart space, smart home and consumer electronics.

Reference address：Design and implementation of an embedded speech recognition module

Previous article：USB programming in application based on STM32F10X bootloader
Next article：Design and implementation of 32-bit MCU development board based on STM32F100VBT6

Popular Resources
Popular amplifiers