DSP fixed-point arithmetic operation
1 Number calibration
In fixed-point DSP chips, fixed-point numbers are used for numerical calculations, and their operands are generally represented by integers. The maximum representation range of an integer depends on the word length given by the DSP chip, which is generally 16 bits or 24 bits. Obviously, the longer the word length, the larger the range of numbers that can be represented and the higher the precision. Unless otherwise specified, this book uses 16-bit word length as an example.
The numbers of DSP chips are represented in 2's complement form. Each 16-bit number uses a sign bit to indicate the positive or negative of the number, 0 indicates a positive value, and 1 indicates a negative value. The remaining 15 bits indicate the size of the value. Therefore,
the binary number 0010000000000011b=8195
The binary number 11111111111111100b=-4
For DSP chips, the numbers involved in numerical calculations are 16-bit integers. But in many cases, the numbers in the mathematical calculation process are not necessarily integers. So, how do DSP chips handle decimals? It should be said that the DSP chip itself is powerless. Does that mean that the DSP chip cannot handle various decimals? Of course not. The key is for the programmer to determine where the decimal point of a number is located in the 16-bit position. This is the calibration of the number.
By setting the decimal point in different positions in the 16-bit number, decimals of different sizes and different precisions can be represented. There are two types of number calibration: Q representation and S representation. Table 1.1 lists the 16 Q representations and S representations of a 16-bit number and the decimal value range they can represent.
From Table 1.1, it can be seen that for the same 16-bit number, if the decimal point is set in a different position, the number it represents is also different. For example, the
hexadecimal number 2000H=8192 is represented by Q0, and
the hexadecimal number 2000H=0.25 is represented by Q15
. However, for the DSP chip, the processing method is exactly the same.
From Table 1.1, it can also be seen that the numbers represented by different Qs not only have different ranges, but also different precisions. The larger the Q is, the smaller the numerical range is, but the higher the precision is; on the contrary, the smaller the Q is, the larger the numerical range is, but the lower the precision is. For example, the numerical range of Q0 is -32768 to +32767, and its precision is 1, while the numerical range of Q15 is -1 to 0.9999695, and its precision is 1/32768=0.00003051. Therefore, for fixed-point numbers, the numerical range and precision are a contradiction. If a variable wants to represent a relatively large numerical range, it must sacrifice precision; and if the precision is to be improved, the representation range of the number will be reduced accordingly. In actual fixed-point algorithms, in order to achieve the best performance, this point must be fully considered.
The conversion relationship between floating-point numbers and fixed-point numbers can be expressed as:
Floating-point number (x) converted to fixed-point number (xq): xq=(int)x* 2Q
Fixed-point number (xq) converted to floating-point number (x): x=(float)xq*2-Q
For example, floating-point number x=0.5, calibration Q=15, then fixed-point number xq=L0.5*32768J=16384, where LJ represents rounding down. Conversely, a fixed-point number 16384 represented by Q=15, its floating-point number is 163*2-15=16384/32768=0.5. When converting floating-point numbers to fixed-point numbers, in order to reduce truncation error, 0.5 can be added before rounding.
Table 1.1 Q represents, S represents and value range
Q represents S represents decimal number range
Q15 S0.15 -1≤x≤0.9999695
Q14 S1.14 -2≤x≤1.9999390
Q13 S2.13 -4≤x≤3.9998779
Q12 S3.12 -8≤x≤7.9997559
Q11 S4.11 -16≤x≤15.9995117
Q10 S5.10 -32≤x≤31.9990234
Q9 S6.9 -64≤x≤63.9980469
Q8 S7.8 -128≤x≤127.9960938
Q7 S8.7 -256≤x≤255.9921875
Q6 S9.6 -512≤x≤511.9804375
Q5 S10.5 -1024≤x≤1023.96875
Q4 S11.4 -2048≤x≤2047.9375
Q3 S12.3 -4096≤x≤4095.875
Q2 S13.2 -8192≤x≤8191.75
Q1 S14.1 -163 84≤x≤16383.5
Q0 S15.0 -32768≤x≤32767
2 High-level language: from floating point to fixed point
When we write DSP simulation algorithms, for convenience, we usually use high-level languages (such as C language) to write simulation programs. The variables used in the program generally include both integers and floating point numbers. For example, in Example 1.1, the variable i is an integer, while pi is a floating point number, and hamwindow is a floating point array.
Example 1.1 256-point Hamming window calculation
int i; +
float pi = 3.14l59;
float hamwindow[256];
for (i = 0; i < 256; i++) hamwindow = 0.54-0.46 * cos (2.0 * pi * i / 255);
If we want to implement the above program with a certain fixed-point DSP chip, we need to rewrite the above program into an assembly language program for the DSP chip. In order to facilitate DSP program debugging and simulate the algorithm performance when implementing fixed-point DSP, it is generally necessary to rewrite the high-level language floating-point algorithm into a high-level language fixed-point algorithm before writing the DSP assembly program. Next, we discuss the fixed-point implementation of basic arithmetic operations.
2.1 C language fixed-point simulation of addition/subtraction operations
Let the expression of floating-point addition operation be:
float x, y, z;
z=x+y;
The most important point when converting floating-point addition/subtraction into fixed-point addition/subtraction is to ensure the calibration of the two operands
temp=x+temp;
z=temp>>(Qx-Qz), if Qx>=Qz
z=temp<<(Qz-Qx), if Qx<=Qz
Example 1.4 Fixed-point addition with a result exceeding 16 bits
Let x=l5000, y=20000, then the floating-point operation value is z=x+y=35000, obviously z>32767, so
Qx=1, Qy=0, Qz=0, then the fixed-point addition is:
x=30000; y=20000;
temp=20000<<1=40000;
temp=temp+x=40000+30000=70000;
z=70000L>>1=35000;
because the Q value of z is 0, the fixed-point value z=35000 is a floating-point value, where z is a long integer. When the result of addition or addition exceeds the 16-bit representation range, if the programmer can know this situation in advance and needs to maintain the calculation accuracy, the 32-bit result must be maintained. If the program is operated according to 16-bit numbers, exceeding 16 bits actually means overflow. If appropriate measures are not taken, data overflow will cause a serious deterioration in calculation accuracy. General fixed-point DSP chips do not have overflow protection function. When the overflow protection function is effective, once an overflow occurs, the result of the accumulator ACC is the maximum saturation value (overflow is 7FFFH, underflow is 8001H), thereby preventing the overflow from causing a serious deterioration in accuracy.
2.2 Fixed-point simulation of multiplication in C language
Assume that the expression of floating-point multiplication is:
float x, y, z;
z=xy;
Assume that after statistics, the calibration value of x is Qx, the calibration value of y is Qy, and the calibration value of the product z is Qz, then
z=xy
zq*2-Qx=xq*yq*2-(Qx+Qy)
zq=(xqyq)2Qz-(Qx+Qy)
So the fixed-point representation of multiplication is:
int x, y, z;
long temp;
temp=(long)x;
z=(temp*y)>>(Qx+Qy-Qz);
Example 1.5 Fixed-point multiplication.
Assume x=18.4, y=36.8, then the floating point operation value is =18.4*36.8=677.12;
according to the previous section, we have Qx=10, Qy=9, Qz=5, so
x=18841; y=18841;
temp=18841L;
z=(18841L*18841)>>(10+9-5)=354983281L>>14=21666;
because the calibration value of z is 5, the fixed point z=21666, that is, the floating point z=21666/32=677.08.
2.3 C language fixed-point simulation of division operation
Assume that the expression of floating-point division operation is:
float x, y, z;
z=x/y;
Assume that after statistics, the calibration value of dividend x is Qx, the calibration value of divisor y is Qy, and the calibration value of quotient z is Qz, then
z=x/y
zq*2-Qz=(xq*2-Qx)/(yq*2-Qy)
zq=(xq*2(Qz-Qx+Qy))/yq
So the fixed-point representation of division is:
int x, y, z;
long temp;
temp=(long)x;
z=(temp<<(Qz-Qx+Qy))/y;
Example 1.6 Fixed-point division.
Suppose x=18.4, y=36.8, the floating-point operation value is z=x/y=18.4/36.8=0.5;
according to the previous section, we have Qx=10, Qy=9, Qz=15; so
z=18841, y=18841;
temp=(long)18841;
z=(18841L<<(15-10+9)/18841=3O8690944L/18841=16384;
because the calibration value of the quotient z is 15, the fixed-point z=16384, that is, the floating-point z=16384/215=0.5.
2.4 Determination of Q value of program variables
In the examples introduced in the previous sections, since the values of x, y, and z are all known, the Q value is easy to determine when changing from floating point to fixed point. In actual DSP applications, variables are involved in the calculation in the program, so how to determine the Q value of variables in floating-point programs? From the previous analysis, we can know that determining the Q value of a variable is actually determining the dynamic range of the variable. Once the dynamic range is determined, the Q value is also determined.
Let the maximum absolute value of the variable be |max|, and note that |max| must be less than or equal to 32767. Take an integer n that satisfies
2n-1<|max|<2n
, then
2-Q=2-15*2n=2-(15-n)
Q=15-n
For example, the value of a variable is between -1 and +1, that is, |max|<1, so n=0, Q=15-n=15.
Since the Q value of a variable can be determined by determining its |max|, how is the |max| of the variable determined? Generally speaking, there are two ways to determine the |max| of a variable. One is theoretical analysis and the other is statistical analysis.
1. Theoretical analysis
The dynamic range of some variables can be determined through theoretical analysis. For example:
(1) Trigonometric function. y=sin(x) or y=cos(x). From the knowledge of trigonometric function, we know that |y|<=1.
(2) Hamming window. y(n)=0.54-0.46cos[nπn/(N-1)], 0<=n<=N-1. Because -1<=cos[2πn/(N-1)]<=1, 0.08<=y(n)<=1.0.
(3) FIR convolution. y(n)=∑h(k)x(nk), let ∑|h(k)|=1.0, and x(n) is the 12-bit quantized value of the analog signal, that is, |x(n)|<=211, then |y(n)|<=211.
(4) Theoretically, it has been proved that in the program design of autocorrelation linear prediction coding (LPC), the reflection coefficient ki satisfies the following inequality: |ki|<1.0, i=1, 2, ..., p, p is the order of LPC.
2. Statistical analysis method
For variables whose range cannot be determined theoretically, statistical analysis is generally used to determine their dynamic range. The so-called statistical analysis is to use enough input signal samples to determine the dynamic range of the variables in the program. Here, the input signal must have a certain number on the one hand, and on the other hand, it must involve various situations as much as possible. For example, in speech signal analysis, statistical analysis must collect enough speech signal samples, and the collected speech samples should include various situations as much as possible. Such as the volume, the type of sound (male voice, female voice, etc.). Only in this way can the statistical results be typical.
Of course, statistical analysis cannot cover all possible situations. Therefore, some protection measures can be taken for the statistical results when designing the program, such as appropriately sacrificing some accuracy, taking the Q value slightly larger than the statistical value, and using the overflow protection function provided by the DSP chip.
2.5 Example of C program for floating point to fixed point conversion
In this section, we use an example to illustrate the method of converting C programs from floating point to fixed point. This is a C language program for low-pass filtering of speech signals (0.3~3.4kHz). The cutoff frequency of the low-pass filter is 800Hz, and the filter uses a 19-point finite impulse response FIR filter. The sampling frequency of the speech signal is 8kHz, and each speech sample value is stored in the insp.dat file as a 16-bit integer.
Example 1.7 C language floating point program for speech signal 800Hz 19-point FIR low-pass filter.
#i nclude
const int length = 180 /*The voice frame length is 180 points = 22.5ms @ 8kHz sampling*/
void filter(int xin[], int xout[], int n, float h[]); /*Filter subroutine description*/
/*19-point filter coefficients*/
static float h[19] =
{0.01218354, -0.009012882, -0.02881839, -0.04743239, -0.04584568, -0.008692503, 0.06446265, 0.1544655,
0.2289794, 0.257883,
0.2289794, 0.1544655, 0.06446265, -0.008692503, -0.04584568,
-0.04743239, -0.02881839, -0.009012882, O.01218354};
static int xl[length+20];
/* Low pass filter Wave floating point subroutine */
void filter(int xin[], int xout[], int n, float h[])
{
int i, j;
float sum;
for(i=0; i for(i=0; i<length; i++)
{
sum=0.0;
for(j=0; j<n; j++)sum+=h[j]*x1[i-j+n-1];
xout=(int)sum;
for(i=0;i<(nl);i++)x1[ni-2]=xin[length-1-i];
}
/*Main program*/
void main()
FILE *fp1,*fp2;
int frame,indata[length],outdata[length];
fp1=fopen(insp.dat,"rb");/*Input voice file*/
fp2=fopen(Outsp.dat,"wb");/*Filtered voice file*/
frame=0;
while(feof(fp1) ==0)
{
frame++;
printf("frame=%d\n",frame);
for(i=0;i<length;i++)indata=getw(fp1); /*Get a frame of voice data*/
filter(indata,outdata,19,h);/*Call low-pass filter subroutine*/
for(i=0; i<length; i++)putw(outdata, fp2);/*Write the filtered sample values to the file*/
}
fcloseall();/*Close the file*/
return(0);
}
Example 1.8 Speech signal 800Hz 19-point FIR low-pass filter C language fixed-point program.
#i nclude
const int length=180;
void filter (int xin[], int xout[], int n, int h[]);
static int h[19]={399, -296, -945, -1555, -1503, -285, 2112, 5061, 7503, 8450, 7503, 5061, 2112,
-285, - 1503, -1555, -945, -296, 399}; /*Q15*/
static int x1[length+20];
/*Low-pass filter fixed-point subroutine*/
void filter(int xin[], int xout[], int n, int h[])
int i, j;
long sum;
for (i=0; i<length; i++)x1[n+i-111=xin];
for(i=0;i<1ength;i++)
sum=0;
for(j=0;j<n;j++)sum+=(long)h[j]*x1[i-j+n-1];
xout=sum>>15;
for(i=0;i<(n-1);i++)x1[ni-2]=xin[length-i-1];
}
The main program is exactly the same as that of floating point. "
3 DSP fixed-point arithmetic operations
The numerical representation of fixed-point DSP chips is based on 2's complement representation. Each 16-bit number is represented by l sign bits, i integer bits and 15-i decimal places. Therefore:
00000010.10100000
represents the value of:
21+2-1+2-3=2.625
. This number can be represented in Q8 format (8 decimal places), and the numerical range it represents is -128 to +l27.996. The decimal precision of a Q8 fixed-point number is 1/256=0.004.
Although special cases (such as dynamic range and accuracy requirements) must use mixed representation. However, it is more common to work with all fractional numbers represented in Q15 format or integers represented in Q0 format. This is particularly true for signal processing algorithms that are mainly multiplication and accumulation. Fractional multiplication by fractional multiplication results in fractional multiplication, and integer multiplication by integer results in integer. Of course, overflow may occur when multiplying and accumulating. In this case, the programmer should understand the physical process in mathematics to pay attention to possible overflow. Let's discuss DSP fixed-point operations for multiplication, addition and division. The assembly program takes TMS320C25 as an example.
3.1 Fixed-point multiplication
When two fixed-point numbers are multiplied, there are three cases:
1. Decimal multiplication
Example 1.9 Q15*Q15=Q30
0.5*0.5=0.25
0.1000000000000000;Q15
* 0.100000000000000;Q15
--------------------------------------------
00.0100000000000000000000000000000=0.25;Q30
After multiplying two Q15 decimals, a Q30 decimal is obtained, that is, there are two sign bits. In general, the full-precision number obtained after multiplication does not need to be retained in full, but only 16-bit single-precision numbers need to be retained. Since the high 16 bits after multiplication are less than 15 bits of small data, in order to achieve 15-bit accuracy, the product can be shifted left by one bit. The following is the TMS320C25 program for the above multiplication:
LT OP1; OP1=4000H(0.5/Q15)
MPY OP2; oP2=4000H(0.5/Ql5)
PAC
SACH ANS, 1; ANS=2000H(0.25/Q15)
|