Voice data processing of IP telephony gateway-EEWORLD

Collect

Abstract: An implementation method of an integrated IP telephony gateway is proposed. The processing process of voice signals in the gateway is analyzed. Voice sampling, playback, compression and decompression, RTP packet encapsulation and unpacking, and IP packets are introduced in detail. The implementation method of receiving and sending.

Keywords: IP telephony gateway voice compression RTP protocol

With the rapid development of IP telephony technology, the implementation of IP telephony is transitioning from PC To PC to Phone to Phone. In the Phone to Phone implementation, a so-called IP telephony gateway is needed to connect the PSTN and the Internet. Therefore, IP telephony gateway has become one of the hot topics in current research in the field of computers and communications. Although many manufacturers at home and abroad are developing IP telephony gateways in different ways, they have a common feature, that is: almost all IP telephony gateways use their own dedicated hardware equipment. This article proposes a method of constructing a hardware-integrated IP telephony gateway using common boards on the market, and studies the processing and implementation methods of voice data in the gateway. The hardware composition of the integrated IP telephony gateway is shown in Figure 1. It is based on a Pentium II PC and is composed of Dialogic's D/41E voice card, LSI's C6200 resource card and D_Link network card. The D/41E voice card is used to complete voice sampling and playback. The C6200 resource card has a TMS320C6201DSP chip from TI, which is used to complete voice compression and decompression as well as echo cancellation. Pentium Ⅱ PC is used to implement the main functions of the H.323 protocol stack, and the network card is used to send and receive IP packets. The following is a detailed analysis of the processing process and implementation method of voice data in the IP telephony gateway.

1 Voice sampling and playback

In this IP telephony gateway, voice sampling and playback are accomplished by Dialogic's D/41E voice card, in which voice sampling is accomplished using the recording function provided by the voice card. During real-time voice communication, voice data is stored in the voice sampling buffer, waiting for the voice compression thread to take it out and process it. The recording function form is as follows:

dx_reciottdata (activeChdev,&chinfo [activeChdev].iott,&tptrec[0],&xpbVox,mode);

The meaning of the input parameters of this function is as follows:

int chdev device handle of the voice channel

DX_IOTT *iott Pointer to voice data destination

DV_TPT *tptp pointer to termination parameter block

DX_XPB *xpbp pointer to I/O transfer block

unsigned short mode recording method

iott is a DX_IOTT type data structure. The io_type in this data structure can take the values IO_DEV and IO_MEM, which are used to specify whether the voice data is stored in a file or in a buffer. Another type of value for Io_type can be IO_CONT, IO_LINK or DX_IOTT, which is used to specify the structure of the voice data destination. If io_type takes the value IO_DEV, the value of io_fhandle should be a file handle; if io_type takes the value IO_MEM, the value of io_fhandle should be 0. At this time, io_bufp points to the starting address of the buffer that stores voice data. io_offset is the address offset. io_length is used to specify the size of the file or buffer. If io_type takes the value IO_LINK, io_nextp points to the next DX_IOTT data structure that stores voice data, and io_pre vp points to the previous DX_IOTT data structure that stores voice data. The data structure of DX_IOTT is defined as follows:

typedef struct dx_iott {

unsigned short io_type; /*Transfer type*/

unsigned short rfu; /*Reserved*/

int io_fhandle; /*File descriptor*/

char* io_bufp; /*Pointer to base memory*/

unsigned long io_offset; /*File/Buffer offset*/

long int io_length; /*Length of data*/

DX_IOTT io_nextp; /*Ptr to next DX_IOTT if IO_LINK set*/

DX_IOTT io_prevp; /*(Optional) Ptr to previous DX_IOTT*/

}DX_IOTT;

The DV_TPT data structure is used to specify the conditions for terminating functions on a certain voice channel. details as follows:

typedef struct DV_TPT {

unsigned short tp_type; /*Flags describing this structure*/

unsigned short tp_termno; /*Termination parameter number */

unsigned short tp_length; /*Length of terminator*/

unsigned short tp_flags; /*Term.parameter attributes flag*/

unsigned short tp_data; /*Optional additional data*/

unsigned short rfu; /*Reserved for future use*/

struct DV_TPT tp_nextp; /Pointer to next term.parameter if*/

/*IO_LINK is set*/

} DV_TPT;

The DX_XPB data structure is used to specify which algorithm to use for recording, etc. WfileFormat can take the values FILE_FORMAT_VOX and FILE_FORMAT_WAVE, which represent the use of VOX file format and WAV file format to store voice data respectively. WDataFormat can take the values DATA_FORMAT_DIALOGIC_ADPCM, DATA_FORMAT_MULAW, DATA_FORMAT_ALAW, and DATA_FORMAT_PCM, which respectively represent sound sampling using ADPCM, μ rate, A rate, or linear PCM algorithms. NSamplesPerSec can take the values DRT_6kHz, DRT_8kHz, and DRT_11kHz, which are used to specify the sampling frequencies of 6kHz, 8kHz, or 11kHz respectively. NBitsPerSample can take values 4 and 8, which are the number of bits per sample point. If wDataFormat uses the ADPCM algorithm, nBitsPerSample can only take 4. The data structure of DX_XPB is defined as follows:

Typedef struct {

USHORT wFileFormat; //file format

USHORT wDataFormat; //audio data format

ULONG nSamplesPerSec; //sampling rate

ULONG nBitsPerSample; //bits per sample

} DX_XPB;

mode is used to specify the recording method, and can take the value PM_TONE, EV_SYNCH or EV_ASYNCH. The value PM_TONE means playing a 200ms tone before recording. When the value is EV_SYNCH, it means that voice sampling is performed in a synchronous manner, and other functions in the same thread will be suspended until the synchronization function is executed and then released. The value EV_ASYNCH means that voice sampling is performed asynchronously, and other functions in the same thread can still be performed as usual.

Voice playback is accomplished using the voice card playback function. The parameters used by this function are similar to those of the recording function. The form of the playback function is as follows:

dx_playiottdata (activeChdev,&chinfo [activeChdev]iott,&tptplay[0],&xpbVox,mode)

2 Phone status detection

The total phone status detection function is mainly used to determine the status of the phone line, such as determining whether the phone receiver is picked up or put down, whether there is a dial tone, whether the phone is busy or no one is answering the phone. In asynchronous mode, use ATDX_CPTERM() of the voice card to detect the return value of a phone call on a certain voice channel. This step is not required in synchronous mode. When the return value is CR_CEPT, it indicates a special notification tone, that is, an invalid phone number has been dialed or other special problems have been encountered. When the return value is CR_NORB, it means there is no ringback tone, that is, no identifiable signal pattern can be detected. When the return value is CR_NOANS, it means there is no response, that is, the line has been dialed, but there is no response. When the return value is CR_NOANS, it indicates busy tone. When the return value is CR_CNCT, the ATDX_CONNTYPE (chdev) function can also be used to detect the type of connection tone. The return value may be CON_CAD, CON_LPC, CON_PVD or CON_PAMD, representing rhythmic connection tone, cyclic flow connection tone, masculine tone connection or anode answering machine connection tone respectively.

When making a call, first use the ATDX_HOOKST (activeChdev) function to obtain the status of the phone. If it is in the on-hook state, use dx_sethook (activeChdev, DX_OFFHOOK, EV_SYNC) to put the phone into the off-hook state. Then pass the required parameters to the calling function. Parameters are passed through the DX_CAP data structure, which is defined as:

typedef struct DX_CAP {

unsigned short ca_nbrdna; //# of rings before no answer.

unsigned short ca_stdely; //Delay after dial before analysis

unsigned short ca_cnosig; //Duration if no signal time out delay

unsigned short ca_lcdly; //Delay after dial before lc drop connect

unsigned short ca_lcdly 1; //Delay after lc drop con. Before msg.

unsigned short ca_hedge; //Edge of answer to send connect message

unsigned short ca_cnosil; //Initial continuous noise timeout delay

unsigned short ca_loltola; //% acceptable pos.dev of short low sig.

unsigned short ca_lo l tolb; //% acceptable neg.dev of short low sig.

unsigned short ca_lo2tola; //% acceptable pos.dev of long low sig.

unsigned short ca_lo2tolb; //% acceptable neg.dev of long low sig.

unsigned short ca_hiltola; //% acceptable pos.dev of high signal.

unsigned short ca_hiltolb; //% acceptable neg.dev of high signal

unsigned short ca_lo 1 bmax; //Maximum interval for shrt low for busy.

unsigned short ca_lo 2 bmax; //Maximum interval for long low for busy.

unsigned short ca_hi 1 bmax; //Maximum interval for lst high for busy

unsigned short ca_nsbusy; //Num.of highs after nbrdna busy check.

unsigned short ca_logltch; //Silence deglitch duration.

unsigned short ca_higltch; //Non-shlence deglitch duration.

unsigned short ca_lo 1 rmax; //Max.short low dur.of double ring

} DX_CAP;

There are a large number of parameter items in this data structure, and generally the default values can be used. When modification is needed, it can be modified through the call dialog box provided by the program. The phone number to call is also specified in this dialog box.

3 Voice data compression and decompression

G.723.1 and G.729 are both speech coding standards recommended by ITU H.323. Its G.723.1 uses ACELP and MP_MLQ algorithms with bit rates of 6.3kbps and 5.3kbps. G.729 uses the CS_ACELP algorithm with a bit rate of 8kbps. Since G.723.1 is superior to G.729 in both bandwidth and voice quality, the G.723.1 voice compression standard is generally used in IP phones.

In the integrated IP telephony gateway, voice compression is completed by the TMS320C6201 DSP on the C6200 resource card. The input signal of the G.723.1 encoder is a 16-bit linear PCM code at 8 kHz. The speech signal sampled by the voice card includes a variety of forms including an 8-bit linear PCM code at 8 kHz. Before being input to the G.723.1 encoder, it needs Make the conversion. Correspondingly, the output voice signal of the decoder should also be converted into a data stream in a format that the voice card can recognize. The Dialog D/4E voice card only recognizes ADPCM codes. Other advanced first-line voice cards such as the D/41ESC voice card can recognize both ADPCM codes and linear PCM codes. The encoder processes one frame of data every 30ms. Each frame contains 240 sample points, and each sample point occupies 16 bits. The general process of the G.723.1 encoder/decoder is as follows:

Each frame first passes through a high-pass filter to remove DC components, and then is divided into 4 subframes, each subframe including 60 sampling points. Each subframe is fed into a 10th-order linear prediction encoder to calculate LPC coefficients. The LPC coefficients of the last subframe are quantized using prediction split vector quantizer (PSVQ). The LPC coefficients before quantization are used to establish a short-term perceptually weighted filter, through which the entire frame signal is passed to obtain a perceptually weighted speech signal. For every two subframes, the perceptually weighted speech signal is used to calculate the open-loop pitch period, and the search range of the pitch period is between 18 and 145. For each subframe, a harmonic noise shaping filter (Harmonic Noise Shaping filter) is established using the estimated pitch period. The LPC synthesis filter, formant sense weighting filter and harmonic noise shaping filter form a joint filter, and then the impulse response of the filter can be obtained. Calculate the closed-loop pitch period using pitch period estimation and impulse response. A fifth-order pitch predictor is used to conduct a small-scale closed-loop search with the open-loop pitch period as the center to obtain the accurate pitch period, and then the contribution of the pitch predictor is deducted from the initial target vector. Finally, the non-periodic excitation signal is estimated. For the high code rate of 6.3kbps, multi-pulse maximum likelihood quantization excitation (MP_MLQ) is used; for the low code rate of 5.3kbps, the algebraic codebook excitation (ACELP) is used.

4 Formation and unpacking of RTP packets

The formation and unpacking of RTP packets is completed by the main CPU (Pentiun II) in the gateway. RTP is IEFT's proposed standard RFC1890. It is an application-independent protocol specification and can have different independent frameworks in specific applications. Each RTP packet consists of a header and a valid data part. The compressed voice data is placed in the valid data section. The first 12 bytes of the RTP header are fixed, and their format is shown in Figure 2.

In the RTP header, the mark M occupies 1 bit and is used to indicate the boundary of the voice data. PT occupies 7 bits and indicates the compression type of voice data. Sequence number occupies 16 bits and is a positive integer sequence number. Each time an RTP packet is sent, the sequence number is incremented by 1. The receiving end can monitor packet loss and out-of-sequence conditions during data packet transmission through the sequence number. The initial value of the sequence number is randomly assigned. Timestamp is a timestamp, accounting for 32 bits. It describes the sampling moment of the voice data in the RTP packet. It is mainly used for synchronization and calculation of delay. Synchronization source identifier is the synchronization resource identifier SSRC, which occupies 32 bits and is used to identify synchronization resources. The RTP packaging process is shown in Figure 3.

If it is the first time to generate an RTP packet, the initial value of the sequence number is a random number instead of 0. The purpose of this is for security during communication. The SSRC identifier is a 32-bit random number. Within an RTP session, two SSRCs are not allowed to have the same value. The RTP unpacking process is the reverse process of the RTP packaging process and will not be described in detail here.

5 Sending and receiving IP packets

The H.323 protocol stack software package can be used to generate API functions that comply with the H.323 protocol. In the program, first execute mcInitialize() and mcSetEventHandler() to establish the H.245 control channel, and then execute the mcOpenCall() function to establish the H.225.0 call signaling channel. The selected audio codec protocol can be set in the parameters of the mcOpenCall() function. The selected audio codec protocol includes G.711 protocol (required), G.722, G.723.1, G.728 or G.729 protocol. When the H.245 control channel and H.225.1 call signaling channel are successfully established, use the mcSendAudio() function to send IP packets containing compressed voice data. When the call ends, use the mcCloseCall() function to close the H.225.0 call signaling channel and H.245 control channel.

The integrated IP telephony gateway proposed in this article has the advantages of simple hardware structure, convenient maintenance, and easy upgrades, and has a high performance-price ratio. What is more valuable is that it broadens the design ideas for designing IP telephony gateways. In addition, this article also details the processing and implementation methods of each stage of voice in the gateway. The processing of voice compression and echo cancellation is the key technology for voice in the gateway, involving the operating efficiency of the gateway's DSP, Technical issues such as optimization of compression algorithms, improvement of voice quality, and noise elimination.

Reference address：Voice data processing of IP telephony gateway

Previous article：Design of a new type of video character overlay
Next article：A frequency doubling method to improve DDS performance