Voice electronic door lock system based on 16-bit single chip microcomputer-EEWORLD

Collect

Biometrics is a technology that uses human biological characteristics to authenticate identity. It is currently recognized as the most convenient and secure identification technology. Since each person's biological characteristics are unique and stable over a certain period of time, they are not easy to forge or counterfeit. Therefore, using biometrics and technology for identity authentication is safe, accurate and reliable.

In the field of biometrics, voiceprint recognition, also known as speaker identification, has attracted worldwide attention for its unique convenience, economy, and accuracy, and has become an important and common security authentication method in people's daily life and work. Voiceprint recognition is a technology that automatically identifies a person's identity based on voice parameters in the speaker's voice waveform that reflect the speaker's physiological and behavioral characteristics.

Voiceprint recognition technology can be divided into two categories, namely speaker identification and speaker confirmation. The former is used to determine which of several people said a certain speech, which is a multiple-choice question; while the latter is used to confirm which of several people said a certain speech, which is a multiple-choice question; while the latter is used to confirm whether a certain speech is said by a specified person, which is a one-to-one discrimination problem. On the other hand, voiceprint recognition can be divided into two types: text-related and text-independent. Depending on specific tasks and applications, the scope of application is different. The text-related voiceprint recognition system requires users to pronounce according to the specified content. Each person's voiceprint model is accurately established one by one, and the pronunciation must also be based on the specified content during recognition, so that better recognition results can be achieved; while the text-independent recognition system does not specify the speaker's pronunciation content, and the model establishment is relatively difficult, but it is easy for users to use and has a wide range of applications.

The voice electronic door lock introduced in this article is a text-related speaker confirmation system implemented on the Lingyang 16-bit single-chip microcomputer SPCE061A. The system is mainly composed of a speaker recognition module, a door lock control motor, and a door lock. During training, the speaker's voice enters the speaker voice signal acquisition front-end circuit through a microphone, and the voice signal processing circuit performs characterization and voice processing on the collected voice signal, extracts the speaker's personality characteristic parameters and stores them to form a speaker characteristic parameter database. During recognition, the voice to be recognized is matched with the speaker characteristic parameter database, and the door lock motor is controlled through the output circuit to finally achieve control of the door lock.

1 Algorithm Principle

The principle block diagram of the speaker recognition algorithm is shown in Figure 1.

1.1 Preprocessing

(1) Denoising

The analog voice signal input by the microphone is quantized and sampled to obtain a digital voice signal; the noisy voice signal is then denoised to obtain a clean voice signal, and the low-frequency interference, especially 50Hz or 60Hz power frequency interference, is filtered out through pre-emphasis technology to enhance the high-frequency part of the voice signal. It can also eliminate DC drift, suppress random noise and enhance the energy of the clear sound part.

(2) Endpoint detection

This system uses the short-time energy and short-time zero-crossing rate of speech signals for endpoint detection. The sampling frequency of the speech signal is 8kHz, each frame of data is 20ms, and there are 160 sampling points in total. The short-time energy and short-time zero-crossing rate are calculated every 20ms. By detecting the short-time energy and short-time zero-crossing rate of the speech signal, the silent frames, white noise frames and unvoiced frames can be eliminated, and finally the voiced signal which is very useful for obtaining the fundamental tone, LPCC and other characteristic parameters is retained.

1.2 Feature extraction

After the speech signal is preprocessed, the next step is to extract feature parameters. The task of feature extraction is to extract the basic features that represent people in the speech signal.

1.2.1 Selection of characteristic parameters

The features must be able to effectively distinguish different speakers and remain relatively stable to changes in the same speaker. At the same time, the feature parameters must be easy to calculate, and it is best to have an efficient and fast algorithm to ensure real-time recognition.

Speaker characteristics can be roughly divided into the following categories:

(1) Parameters extracted based on the physiological structure of vocal organs such as the glottis, vocal tract and nasal cavity. Such as spectral envelope, fundamental pitch, resonance peak, etc. Among them, fundamental pitch can well characterize the characteristics of the speaker's vocal cords and reflect the personality characteristics of the person to a large extent.

(2) Parameters obtained through linear prediction analysis based on the vocal tract feature model. These include the linear prediction coefficient (LPC) and various parameters derived from linear prediction, such as the linear prediction cepstral coefficient (LPCC), partial correlation coefficient, reflection coefficient, logarithmic area ratio, LSP line spectrum pair, linear prediction residual, etc. According to the previous work results and actual test comparison, the LPCC parameters can not only better feedback the resonance peak characteristics of the vocal tract and have a better recognition effect, but also can be obtained with relatively simple calculations and faster speed.

(3) Characteristic parameters based on the human hearing mechanism, reflecting the hearing characteristics, and simulating the human ear's perception of sound frequency. Such as the American Millimeter Cepstral Coefficient (MFCC). Compared with the cepstral analysis based on linear prediction, the outstanding advantage of MFCC parameters is that they do not rely on the assumption of the full-pole speech production model. In the speaker recognition system unrelated to Guangxi, MFCC parameters can better improve the recognition performance of the system than LPCC parameters.

In addition, people also improve the performance of the actual system by combining different characteristic parameters. When the correlation between the combined parameters is not large, there will be better results because they reflect different characteristics of the speech signal.

In the simulation experiment on the computer platform, through the actual comparison of various parameters, the use of MFCC parameters has better recognition effect than the use of LPCC parameters. However, when doing real-time processing on the SPCE061A platform, compared with the LPCC system, the calculation of MFCC coefficients has two disadvantages: one is that the calculation time is long; the other is that the accuracy is difficult to guarantee. Since the calculation of the MFCC system requires FFT transformation and logarithmic operation, the dynamic range of the calculation is affected; to ensure the real-time recognition of the system, the accuracy of the parameters must be sacrificed. The calculation of LPCC parameters has a recursive formula, which can guarantee both speed and accuracy, and the recognition effect also meets actual needs.

This system uses pitch period and linear prediction cepstral coefficients (LPCC) as feature parameters for speaker recognition.

1.2.2 Extraction of LPCC parameters

The cepstrum parameter LPCC based on linear prediction analysis can be obtained from the linear prediction coefficient through a simple recursive formula. The recursive formula is as follows:

Where p is the order of the LPC model, which is also the number of poles of the model.

(1) Determination of the LPC model order p

In order to make the model assumptions better fit the speech production model, the order p of the LPC model should be consistent with the number of resonance peaks, and secondly, the compensation of the glottal pulse shape and lip radiation should be considered. Usually a pair of poles corresponds to a resonance peak. A speech signal sampled at 10kHz usually has 5 resonance peaks, so p=10 is taken. For a speech signal sampled at 8kHz, p=8 can be taken. In addition, in order to compensate for the zero point in nasal sounds and the deviation caused by other factors, two more poles are usually added on the basis of the above order, namely p=12 and p10 respectively. Experiments show that the selection of LPC analysis order p=12 can approximate the vocal tract model of most speech signals sufficiently. Although choosing a too large P value can slightly improve the approximation effect, it also brings some negative effects. On the one hand, it increases the amount of calculation, and on the other hand, it may add some unnecessary details.

(2) Determination of linear prediction coefficients

[page]

The autocorrelation solution mainly includes several recursive algorithms such as Durbin algorithm, Lattice algorithm and Schur algorithm. Among them, Durbin algorithm is the most commonly used algorithm at present, and the amount of calculation is also small when obtaining LPC coefficients. This system adopts this recursive algorithm.

1.2.3 Extraction of pitch parameters

There are many methods for pitch estimation, mainly including pitch estimation methods based on short-time autocorrelation function and short-time average amplitude difference function (AMDF).

(1) Pitch estimation based on short-time autocorrelation function

The short-time autocorrelation function has a large peak at the position of an integer multiple of the fundamental pitch period. The fundamental pitch period can be estimated by simply finding the position of the first maximum peak.

(2) Pitch estimation based on short-time average amplitude difference function (AMDF)

Based on the fact that the short-time average amplitude difference function (AMDF) has a large valley value at the integer multiple position of the fundamental pitch period, the fundamental pitch period can be estimated by finding the position of the first maximum valley value. The disadvantage of this method is that when the amplitude of the speech signal changes rapidly, the valley depth of the AMFD function will decrease, thus affecting the accuracy of the fundamental pitch estimation.

In fact, the position of the first maximum peak (valley) value point sometimes does not coincide with the fundamental frequency. The position of the first maximum peak (valley) value point is related to the length of the short-time window and will be affected by the resonance peak. Generally, the window length should be at least greater than two fundamental frequency periods to obtain a better estimation effect. The longest fundamental frequency period in speech is about 20ms. This system selects a window length of 40ms when estimating the fundamental frequency period. In order to reduce the influence of the resonance peak, the speech is first band-pass filtered in the frequency range of [60,900] Hz. Because the highest fundamental frequency is 450Hz, setting the upper limit frequency to 900Hz can retain the first and second harmonics of the speech, and the drop frequency is 60Hz to filter out the 50Hz power supply interference.

Both of the above methods are to find the corresponding function of the speech signal itself. The pitch estimation method adopted by this system is: firstly, linearly predict the short-time speech signal after bandpass filtering to obtain the prediction residual; then, the autocorrelation function of the residual signal is calculated to find the position of the first maximum peak point, that is, to obtain the pitch estimation value of the speech segment. Experiments show that the pitch trajectory obtained by residual is better than the pitch trajectory obtained directly by speech, as shown in Figure 2. In Figure 2, the horizontal axis is the number of speech frames, and the vertical axis is 8000/f, where f is the pitch frequency.

1.3 Pattern Matching

At present, the research on pattern matching methods proposed for various feature parameters is becoming more and more in-depth. Typical methods include: vector quantization method, Gaussian mixture model method, hidden Markov model method, dynamic time warping (DTW) method and artificial neural network method.

These methods have their own advantages and disadvantages. The DTW algorithm has too much computational effort in template matching for longer speech recognition, but it is simple and effective for short speech (effective speech length is less than 3s), and its recognition rate is not lower than other methods. It is particularly suitable for short speech and text-related speaker recognition systems. This system uses the endpoint relaxation two-point (DTW) algorithm, which does not increase the computational effort caused by endpoint relaxation, and can also relax the accuracy requirements for endpoint detection.

The dynamic time warping (DTW) algorithm is based on the idea of dynamic programming and solves the problem of matching speakers with different pronunciation lengths and speaking speeds at different times. The DTW algorithm is used to calculate the similarity between two templates of different lengths, expressed as the distortion distance. Assume that the test template and the reference template are represented by T and R respectively, and contain N frames and M frames of speech parameters in chronological order (12-dimensional LPCC parameters in this system). The smaller the distortion distance, the closer T and R are. Mark the frame numbers n=1~N of the test template on the horizontal axis of a two-dimensional rectangular coordinate system, and mark the frame numbers m=1~M of the reference template on the vertical axis, as shown in Figure 3. Drawing vertical and horizontal lines through these integer coordinates representing frame numbers forms a network. Each intersection point (n, m) in the grid represents the intersection point of a frame in the test template and a frame in the reference pattern, corresponding to the Euclidean distance of the two vectors. The DTW algorithm can be summarized as finding a path through several intersection points in this grid so that the sum of the distances of the nodes on the path (i.e., the distortion distance) is minimized. For the case of endpoint relaxation, the path search principle is the same, except that the search path is increased.

2 Hardware System

The core of the voice electronic door lock system is the speaker recognition module. It includes key input, voice signal acquisition, voice signal processing, FLASH storage expansion, speaker output, control output and LCD module. The principle block diagram of the speaker recognition model is shown in Figure 4. Its core is voice signal processing. This system uses the Lingyang 16-bit microcontroller SPCE061A, which is particularly suitable for the field of digital voice recognition, and realizes programming control of other components through SPCE061A.

SPCE061A is a 16-bit single-chip microcomputer with a very high cost performance ratio developed by Lingyang Company. In the operating voltage range of 2.6V to 3.6V, the operating frequency range is 0.32MHz to 49.152Mhz. The high processing speed enables it to process complex digital signals very easily and quickly; the interrupt system supports 10 interrupt vectors and 14 interrupt sources from system clock, timer/counter, time base generator, external interrupt, key wake-up, general asynchronous serial communication and software interrupt, which is very suitable for real-time applications; embedded 2K words of SRAM and 32K words of FLASH, with 32-bit programmable multi-function I/O port; contains 7-channel 10-bit general-purpose A/D converter and built-in microphone amplifier and automatic gain control The single-channel sound A/D converter with AGC function and the dual-channel 10-bit D/A converter with audio output function are used. SPCE061A adopts CMOS manufacturing technology and adds software-excited weak vibration mode, idle mode and power-down mode. When the system is in standby state (clock is stopped), the power consumption is only 2μA3.6V, which greatly reduces its power consumption. In addition, μ'nSPTM's instruction system also provides 16-bit × 16-bit multiplication and inner product instructions with high operation speed, adding DSP function to its application, which is very convenient in complex digital signal processing and much cheaper than dedicated DSP chips.

[page]

The functions performed by the various components of the speaker recognition module are as follows:

(1) Key input part: There are 16 keys including numeric keys, training keys, delete keys, confirmation keys and cancel keys, which are used for password input and working mode selection. It adopts 4×4 matrix keyboard input and only uses the lower 8 bits of IOA with key wake-up function, which can reasonably utilize hardware resources and is flexible in programming.

(2) Voice signal acquisition part: 8kHz voice signal acquisition is completed through the SPCE061A single-channel sound A/D converter with built-in microphone amplifier and automatic gain control AGC function.

(3) FLASH storage extension part: used to store the speaker's personality characteristic parameter reference template.

(4) Speaker output part: The SPCE061A dual-channel 10-bit D/A converter with audio output function is used to complete voice prompts for various operations such as user training and recognition.

(5) Control output part: control the door lock control motor through the programmable I/O port of SPCE061A.

(6) LCD module: used to display the working status of the system. This part is optional based on cost and actual needs.

(7) SPCE061A: The speaker's speech signal processing and the programming control of each part are completed by SPCE061A.

The speaker recognition module has three working modes: training mode, authentication mode and password mode. These three modes can be selected through the working mode button.

(1) Training mode: The speaker's voice enters the voice signal acquisition front-end circuit through the microphone. When the first voice is input, the 16-bit single-chip microcomputer SPCE061A processes the collected voice signal, extracts the speaker's individual characteristic parameters, and stores them in the external FLASH to form a speaker characteristic parameter template. Three trainings can be performed. When the second voice is input, the extracted number characteristic parameters are matched with the characteristic parameter template formed by the first voice input. When the matching distance is less than the template update threshold, the speaker characteristic parameter template is updated to the average of the two characteristic parameters. When the third voice is input, the extracted individual characteristic parameters are matched with the characteristic parameter template formed by the first and second voice inputs. When the matching distance is less than the template update threshold, the speaker characteristic parameter template is updated to the average of the three characteristic parameters to form the final characteristic parameter template of the speaker.

(2) Authentication mode: The speaker’s voice is also recorded through the microphone. The SPCE061A then processes the collected voice signal and matches the extracted speaker feature parameters with the feature parameter template stored in the external FLASH. When the matching distance is less than the authentication threshold, the authentication is passed. Then, it is determined whether the matching distance is less than the template update threshold in the authentication mode to decide whether to update the template.

(3) Password working mode: If the speaker has a cold or other reasons that cause their voice to temporarily change, a long password can be used for authentication to avoid being denied access due to extraordinary reasons.

In addition, each user has a short password (users can modify it), which must be entered in both training mode and authentication mode to form or find the feature parameter template corresponding to the user. The system also sets a super administrator user with a long password, who can add or delete user templates through the keyboard.

3 Experimental Results

For speaker verification systems, the two most important parameters that characterize their performance are the rejection rate and the false recognition rate. The former is the error caused by rejecting the real speaker, and the latter is the error caused by accepting the imposter. Both are related to the setting of the matching threshold. The setting of the matching threshold is related to the application and functional emphasis of the voice lock system. For door lock users such as homes and hotels, the false recognition rate is required to be as low as possible, or even zero; if it is used for similar functions such as company employee attendance, the rejection rate cannot be too high. Table 1 is the result of 100 real-time matches for each of the following situations, where the set threshold is suitable for door lock users.

Table 1 100 real-time matching results

From the above experimental results, we can see that the rejection rate for the same person sending the same message is 8%; for the same person with similar pronunciation, because the system is judging the speaker, in this case, whether rejection or acceptance is reasonable; for the same person with different pronunciations and different people's pronunciations, the misrecognition rate is zero. The recorder was used for multiple experiments, and the number of times the authentication was passed was zero. For door lock users, this result is very ideal. If it is used for similar functions such as attendance, it can be achieved by modifying the matching threshold value.

Compared with other biometric technologies, voiceprint recognition has the following advantages, in addition to the advantages of not being lost or forgotten, not requiring memory, and being easy to use: high user acceptance, since it does not involve privacy issues, users have no psychological barriers; voice input equipment is low-cost, while input equipment for other biometric identification technologies is usually expensive. Compared with door locks that use technologies such as iris, fingerprints, and faces, the voice electronic door lock system built on SPCE061A has the advantages of low cost, easy use, and good confidentiality. A large number of experimental tests have shown that the system has stable performance and good recognition effect. The next step will be to conduct a small batch trial to find problems and improve them. However, when the environmental noise or interference signal is higher than the voice signal, the system will not be able to perform correct voice recognition, and background noise processing and its engineering actually need to be further improved.

Reference address：Voice electronic door lock system based on 16-bit single chip microcomputer

Previous article：Application research of intelligent RFID reader in UHF band
Next article：Application of digital technology of video doorbell in smart community

Popular Resources
Popular amplifiers