Research on Speaker Recognition in Coding Domain Based on DTW

Publisher:温暖微笑Latest update time:2010-10-18 Source: 电子技术应用 Reading articles on mobile phones Scan QR code
Read articles on your mobile phone anytime, anywhere

Speaker recognition, also known as speaker identification, refers to automatically confirming whether the speaker is in the recorded speaker set and further confirming the speaker's identity by analyzing and processing the speaker's voice signal. The basic principle of speaker recognition is shown in Figure 1.

According to the content of speech, speaker recognition can be divided into two types: text-independent and text-dependent. Text-independent recognition systems do not specify the content of the speaker's pronunciation, and the model is relatively difficult to build, but it is convenient for users to use. Text-dependent speaker recognition systems require users to pronounce according to the specified content, and the pronunciation must also be in accordance with the specified content during recognition, so better recognition results can be achieved.

With the development of network technology, the Internet phone VoIP (Voice over IP) technology that transmits voice through the Internet has developed rapidly and has become an important means of daily communication. More and more users are abandoning traditional communication methods and communicating through computer networks and other media. Due to the characteristics of VoIP working mode, voice undergoes voice encoding and decoding during transmission, and the VoIP device port has to process multiple channels and massive amounts of compressed voice data at the same time. Therefore, VoIP speaker recognition technology mainly studies how to perform speaker recognition based on decoding parameters and compressed code streams at high speed and low complexity.

Existing research on coding domain speaker recognition methods mainly focuses on the extraction of coding domain speech feature parameters. The Hong Kong Polytechnic University studied the extraction of information from G.729 and G.723 coded bit streams and residuals, and adopted the fractional compensation method. The University of Science and Technology of China mainly studied speaker recognition for AMR speech coding. Northwestern Polytechnical University studied compensation algorithms for different speech coding differences in speaker confirmation, and studied the method of directly extracting parameters from the G.729 coded bit stream. The speaker model mainly uses GMM-UBM (Gaussian Mixture Model-Universal Background Model), which is the most widely used in traditional speaker recognition. The application effect of GMM-UBM is closely related to the number of mixed elements. On the basis of ensuring the recognition rate, its processing speed cannot meet the needs of high-speed speaker recognition in VoIP environment.

This paper studies the real-time speaker recognition in the G.729 coding domain of VoIP voice streams, and successfully applies the DTW recognition algorithm to text-related real-time speaker recognition in the G.729 coding domain.

1 Feature Extraction in G.729 Coded Bitstream

1.1 G.729 Coding Principle

ITU-T announced G.729 coding in March 1996. Its coding rate is 8 kb/s. It uses the structured algebraic code excited linear prediction technology (CS-ACELP). The coding result can obtain a synthetic sound quality not lower than 32 kb/s ADPCM at a bit rate of 8 kb/s. The algorithm delay of G.729 is 15 ms. Because the G.729 codec has high voice quality and low delay, it is widely used in various fields of data communication, such as VoIP and H.323 online multimedia communication systems.

The encoding process of G.729 is as follows: the input 8 kHz sampled digital speech signal is first preprocessed by high-pass filtering, and a linear prediction analysis is performed every 10 ms frame to calculate the 10th-order linear prediction filter coefficients, and then these coefficients are converted into line spectrum pair (LSP) parameters and quantized using a two-stage vector quantization technique. When searching the adaptive codebook, the search is performed based on the minimum weighted error perception between the original speech and the synthesized speech. The fixed codebook uses an algebraic codebook mechanism. The excitation parameters (adaptive codebook and fixed codebook parameters) are determined once per subframe (5 ms, 40 samples).

1.2 Feature Parameter Extraction

The LSP parameters can be obtained directly from the G.729 coded stream by dequantizing it according to the quantization algorithm. Since the speaker recognition system in the later stage also needs the excitation parameters, and the excitation parameters are interpolated and smoothed by LSP in the calculation process, in order to make the channel and excitation parameters in the feature vector accurately correspond, the dequantized LSP parameters should be interpolated and smoothed.




This paper selects the arc cosine LSF of the LSP(1) parameter of the first subframe in the G.729 coded frame and the LPC and LPCC parameters converted from it as the channel feature parameters.

Reference [1] found that when the speech gain parameters in the G.729 compressed frame were added to the recognition features, the speaker recognition performance decreased. The gain parameters GA1, GB1, GA2, and GB2 in the G.729 compressed code stream features were removed. It was found that when the feature vector scheme X=(L0, L1, L2, L3, P1, P0, P2) was adopted to remove the gain parameters, the recognition performance was improved. Therefore, the G.729 compressed code stream features finally adopted in this paper are X=(L0, L1, L2, L3, P1, P0, P2), with a total of 7 dimensions.

2 Dynamic Time Warping (DTW) Identification Algorithm

Dynamic Time Warping (DTW) is a nonlinear warping technique that combines time warping with distance measurement calculation. The algorithm is based on the idea of ​​dynamic programming and solves the problem of template matching with different pronunciation lengths.

Algorithm principle: Assume that the test speech and the reference speech are represented by R and T respectively. In order to compare the similarity between them, the distance D[T, R] between them can be calculated. The smaller the distance, the higher the similarity. In the specific implementation, the speech is preprocessed first, and then R and T are divided into frame series at the same time interval:

Then dynamic programming is used for recognition, as shown in Figure 2.

Mark the frame numbers n=1,…,N of the test template on the horizontal axis of a two-dimensional rectangular coordinate system, and mark the frame numbers m=1,…,M of the reference template on the vertical axis. The horizontal and vertical lines drawn by these integer coordinates representing the frame numbers can form a grid. Each intersection point (n, m) in the grid represents the intersection of a frame in the test template and a frame in the training template. The dynamic programming algorithm can be summarized as finding a path through several grid points in this grid. The grid points passed by the path are the frame numbers of the distance calculation in the test and reference templates.

The whole algorithm mainly boils down to calculating the similarity between the test frame and the reference frame and accumulating the vector distance of the selected path.
The recognition process is shown in Figure 3.

3 Experimental results, performance analysis and conclusions

In order to test the above recognition performance, a fixed text speaker recognition test was conducted. In the test, 300 recording files of 30 speakers from the telephone channel 863 corpus were used, and the file format was 16 bit linear PCM. In order to simulate the voice compression frame in VoIP, the G.729 vocoder was used to compress the original voice file. One file of each speaker was used for training as a template. The test voice length was 10 s to 60 s with a total of 11 test time standards at an interval of 5 s. In this way, there are 30 templates in the template library and 270 test voices. The microcomputer configuration used is: CPU Pentium 2.0 GHz, memory 512 MB.

In the experiment, M and N are set to 64, and through matching between templates, it is determined that the recognition effect is best when the decision threshold is 0.3.

In order to compare the recognition performance of the DTW algorithm, the GMM model widely used in traditional speaker recognition is used as a comparative experiment, where the GMM model uses the same coding stream features as the DTW algorithm.

Figure 4 shows a comparison of the error rates of text-related speakers of the G.729 coding scheme 863 corpus using the DTW recognition method and the GMM model (mixed number 64). The horizontal axis represents the duration of the test speech, and the vertical axis represents the recognition error rate. The experimental results show that in text-related speaker recognition, the recognition rate based on the DTW algorithm is higher than that of the GMM model in most cases, and the advantage becomes more obvious as the test speech increases.

To compare the time performance of feature extraction and the overall time performance, the experimental conditions are as follows:

(1) The speech of the 50 speakers was selected for feature extraction only, and the total length of the test speech was about 25 minutes;

(2) Decode and recognize the test speech and identify the coded stream respectively, with 10 templates;

(3) The computer configuration is: CPU Pentium 2.0 GHz, memory 512 MB.

Table 1 shows the comparison results of feature extraction time, and Table 2 shows the comparison results of speaker recognition time.

It can be seen from the experimental results that the time for feature extraction and recognition in the encoded bit stream (continued from Page 121)
is much shorter than the time for speech feature extraction and recognition after decoding and reconstruction, which meets the needs of real-time speaker recognition.

In text-related speaker recognition, compared with the GMM model using the same G.729 compressed code stream features, the DTW method has higher recognition rate and processing efficiency than the GMM model, and can be applied in real-time to VoIP network supervision.

Reference address:Research on Speaker Recognition in Coding Domain Based on DTW

Previous article:Research on Software Reliability Data Preprocessing
Next article:DSP download file generation based on COFF file analysis and extraction

Latest Embedded Articles
Change More Related Popular Components

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

About Us Customer Service Contact Information Datasheet Sitemap LatestNews


Room 1530, 15th Floor, Building B, No.18 Zhongguancun Street, Haidian District, Beijing, Postal Code: 100190 China Telephone: 008610 8235 0740

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号