SMS filtering system based on improved balanced Winnow algorithm-EEWORLD

Collect

Mobile phone text messages have become an important means of communication and exchange due to their advantages of being short, fast, simple and inexpensive, and are favored by many people. However, mobile phone text messages, like emails, have the problem of spam.

At present, spam SMS filtering mainly includes blacklist filtering, keyword filtering and content filtering based on text classification. Blacklist filtering and keyword filtering can quickly filter spam SMS, but these two filtering methods are actually rule-based filtering. Although some spam SMS are blocked to a certain extent, the rule-based method requires more user-defined settings and is easily counter-filtered. SMS filtering based on text classification uses common classification algorithms, such as naive Bayes, SVM, neural network, etc. Li Lu et al. applied Bayesian classification to the J2ME simulation environment and successfully filtered winning SMS and blessing SMS. Jin Zhan, Fan Jing et al. of Zhejiang University combined naive Bayes with support vector machine to solve the problem of reduced filtering performance caused by failure to update SMS features and content in time in traditional spam SMS filtering system. Wang Zhongjun conducted experimental analysis and comparison between the naive Bayes SMS filtering algorithm and the minimum risk Bayes algorithm, and concluded that the minimum risk SMS filtering algorithm has better performance.

However, the accuracy of SMS filtering depends on the quantity and quality of its training samples. These classification algorithms need to be trained and learned to establish a classifier model, so they cannot meet the real-time requirements of SMS filtering in terms of speed.

From the perspective of existing technology, the accuracy and efficiency of spam text message filtering still cannot meet actual needs.

In view of the shortcomings of existing SMS filtering technology, this paper designs an SMS filtering system for mobile terminals, combining blacklist and whitelist with content-based filtering according to the characteristics of spam SMS. This filtering method requires the ability to quickly classify SMS and meet the user's personalized requirements for SMS filtering, so that the spam SMS filtering system has better filtering performance.

The Winnow algorithm is a linear classification algorithm proposed by Nick Littlestone in 1987 and its feasibility was rigorously proved. The goal at that time was to find an algorithm whose time and space complexity is only linearly related to the number of attributes related to the classification object. The balanced Winnow algorithm is an improvement on the basic Winnow algorithm. The algorithm has the advantages of fast filtering speed, good performance, and support for feedback updates. It has a good application prospect in the field of information filtering, especially suitable for SMS filtering systems with high real-time requirements.

This paper designs and implements a text message content filtering system based on the balanced Winnow algorithm, and analyzes the application of the algorithm in the text message filtering system in detail. The training process of the classifier is divided into four parts: preprocessing, training, classification and feedback.

1 Preprocessing module

The preprocessing module includes Chinese word segmentation, feature extraction, and SMS vector representation sub-modules.

1.1 Chinese word segmentation

Chinese word segmentation is a research topic unique to Chinese. In Indo-European languages such as English and French, there is a natural segmentation between words, and generally there is no problem of word segmentation. This system uses the Chinese lexical analysis system ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) developed by the Institute of Computing Technology of the Chinese Academy of Sciences, which is currently widely used in China.

ICTCLAS 3.0 has a single-machine word segmentation speed of 996 Kb/s, a word segmentation accuracy of 98.45%, an API of no more than 200 KB, and various dictionary data of less than 3 MB after compression. It is currently a relatively good Chinese lexical analyzer.

1.2 Feature Extraction

There are many methods of feature extraction at present. Commonly used feature selection methods include: document frequency DF (Document Frequency), information gain IG (Information Gain), mutual information MI (Mutual Information), χ2 ^statistics , etc.

This paper takes the words after word segmentation as candidate features, and then uses the feature extraction algorithm to extract the most useful features for classification, and removes the candidate features that do not contribute much to classification, so as to reduce the dimension of the features. The main idea of χ2 ^is that the terms and categories conform to the χ2 ^distribution . The higher the value of the ^χ2 statistic, the smaller the independence and the stronger the correlation between the feature item and the category, that is, the greater the contribution of the feature item to this category. ^χ2 is a normalized value. This method can reduce about 50% of the vocabulary compared with other methods, and has the advantage of good classification effect. This paper uses χ2 ^statistics for feature extraction.

But it is not simply to set the weight of the feature item xi = 1 or 0, but to set xi = f(χ ² ) or 0, where χ ^{2
specifically refers to the χ}² statistic corresponding to the feature , and the corresponding relationship f depends on the actual situation. In the experiment, let (n is a positive integer, take n = 4). The experiment shows that it is better than using Boolean weight representation.

1.3 Text Vector Representation

The most widely used model is the Vector Space Model (VSM). In this paper, VSM is used to represent a short message as a vector form of (W1, W2, ..., Wk, ..., Wn). Among them: Wk (k=1, 2, ..., n) is the weight of the kth feature, and n is the number of selected features.

2 Constructing a classifier

Training the classifier is the focus of the study, and the Balanced Winnow algorithm is adopted and improved.

2.1 Winnow Classification Algorithm

The Winnow algorithm is a linear classification algorithm for binary attribute data sets. The hyperplane equation representing the classification boundary in the linear classification problem is as follows:

w0α0+w1α1+w2α2+…+wkαk=0, where: α0, α1, …, αk are the values of the attributes respectively; w0, w1, …, wk are the weights of the hyperplane. If its value is greater than 0, it is predicted as the first category, otherwise it is predicted as the second category.

The Winnow algorithm is an error-driven classification algorithm, that is, the weight vector is updated only when a misclassified instance occurs. Two learning coefficients α and β are set (where α>1, β<1), and the weights are modified by multiplying the weights by the parameters α (or β).

2.2 Balanced Winnow Classification Algorithm

The standard Winnow algorithm does not allow negative weights, so there is another version of Winnow called balanced that allows negative weights.

For the basic form of the Winnow algorithm, each dimension of the weight vector is a positive number. Balanced Winnow replaces w with w+-w-, and when , the instance is classified as this class. The weight update strategy of Balanced Winnow is:

(1) If , but the text does not belong to this category, then the weight should be reduced: for j=1,,…,d, if xj≠0, then xj≠0, w+j =βw+j, wj =αw-j, α＞1, 0＜β＜1.

(2) If but the text should belong to this category, then the weight should be increased: for j=1,2,…,d, if xj≠0, then w+j =αw+j, wj =βw-j, α＞1, 0＜β＜1.

In the experiment, the method of unifying α and β as one parameter in the literature [7] was adopted, and β = 1/α was set. This did not affect the classification effect, but effectively simplified the parameter selection. Different θ values can be determined for different categories, but experiments show that: for different categories, the same θ value is selected, and the results are almost the same. Therefore, the same θ value is taken in each independent experiment. The size is the average number of features contained in the training text, and the initial w+ and w- are all 2 and all 1 vectors respectively.

In the balanced Winnow algorithm, once the parameters α, β and threshold θ are determined, the weight vectors w+ and w- will be continuously updated during the training process to best fit this set of parameters. Therefore, the dependence on parameters is small, and few parameters need to be manually adjusted.

2.3 Removing outliers

In SMS filtering, SMS samples are collected manually or automatically. Errors are inevitable in the collection process. Therefore, there may be some sample points that are artificially misclassified in the SMS sample set, that is, outliers. These outliers will cause serious jitter in the classifier during training, reducing the performance of the classifier. Therefore, a good classifier should have the ability to identify outliers.

For the Winnow algorithm, if there are wild points in the sample, the wild points will appear outside the two classification lines with a greater probability during training and will be classified incorrectly.

These wild points have a great impact on the training process of the classifier and may cause the classifier to "overlearn". Therefore, the loss function is introduced. According to the definition of the loss function, these wild points have a large loss. Therefore, the wild point problem in the linear classifier can be handled by setting an upper bound function for the loss function, as shown in Figure 1.

Figure 1 shows two types of linearly separable cases. The solid points and hollow points in the figure represent two types of training samples, respectively. H is the classification line where the two types of samples are not incorrectly separated. H1 and H2 are two straight lines that are parallel to the classification line H and have a unit distance from the classification line H. The straight line G(t) is the upper limit of the loss function after the tth round of iteration in the balanced Winnow algorithm. The upper limit is a function of the number of iterations t, so the upper limit function corresponding to the upper limit G(t) can be recorded as g(t). As can be seen from Figure 1, the loss of misclassified samples on the lower left side of the straight line G(t) is small, and it can be considered that these misclassified samples are misclassified due to the low performance of the current classifier; the misclassified samples on the upper right side of the straight line G(t) still have a large loss after the tth round of iteration, so these misclassified samples can be considered to be wild points. According to the properties of linear classifiers and wild points, it can be seen that the upper limit function g(t) has the following properties:

(1) As the number of iterations t in the Winnow algorithm increases, the upper bound function g(t) decreases monotonically, and the rate of decrease also decreases as t increases, that is, the derivative of the upper bound function g(t) is a monotonically decreasing function; (2) The upper bound function cannot be too large or too small. If it is too large, the ability to judge outliers will be reduced, and if it is too small, normal samples will be misjudged as outliers.

According to these characteristics of the upper bound function, a linear function parallel to the classification line H can be considered as the upper bound function of the loss function. That is, g(t)=

Where: ε is a constant value; the straight line G(t) is parallel to the classification line H; η is the loss factor, also known as the learning rate, and its value can be specified when training the classifier.

In each round of training, if the G(t) value of the sample is greater than the value of the classification line and exceeds a certain threshold, and does not belong to the class, the sample is judged to have the nature of an outlier and should be removed from the training set to improve the accuracy of the next round of training. This not only effectively weakens the jitter phenomenon of the classifier, but also improves the performance of the classifier.

3 System feedback

Winnow is an online learning, error-driven classifier, suitable for combining incremental learning to solve adaptive problems and realize user personalized requirements. The balanced Winnow algorithm is another form of the basic Winnow algorithm, which also has the ability to update online. During the classifier training process, the category weight vector of the misclassified SMS is updated through α and β to update the classifier, and the bidirectional adjustment of w+ and w- in the balanced Winnow algorithm makes the algorithm training faster, which is suitable for SMS filtering systems with high requirements for real-time classification.

4 Experimental resources, analysis and evaluation

This paper completes comparative experiments based on a self-built SMS corpus, which contains 1,892 normal SMS messages and 270 spam SMS messages. The SMS corpus is randomly divided into five equal parts, four of which are used as training samples and one as a test sample.

4.1 Evaluation Metrics

The evaluation indicators of the classification system are as follows, including the accuracy and recall of the two types of text messages. Since the system goal is to filter spam text messages, a comprehensive evaluation indicator for spam text messages (F1) is added:

F1=(2×precision×recall)/(precision+recall).

4.2 Experimental Results Analysis

(1) Experiment 1: Explore the impact of the improved feature weight calculation method on the experimental results. The experimental results are shown in Table 1.

Table 1 The impact of feature weight calculation method on experimental results

Among the test samples, 22 normal text messages were mistakenly classified as spam text messages, and the normal text message recall rate was 94.2%; 8 spam text messages were mistakenly classified as normal text messages, and the accuracy rate was only 67.7%.

(2) Experiment 2: The impact of unified parameters and fixed threshold θ on the experimental results. In this experiment, α=1.5, β=1/1.5, θ=15. The experimental results are shown in Table 2.

Table 2 Influence of selected parameters on experimental results

Among the test samples, 18 normal text messages were misclassified as spam, and the recall rate of normal text messages was 96.1%. However, 44 spam text messages used in the test were correctly identified, with an accuracy rate of 71.0%. This shows that the parameters have little effect on the experimental results.

(3) Experiment 3: The influence of removing wild points on the experimental results. The experimental results are shown in Table 3.

Table 3 The effect of removing wild points on experimental results

From the analysis of the experimental results, only 12 normal text messages and 8 spam text messages were misclassified. By removing wild points, it was found that not only the jitter phenomenon was alleviated, but also the classification performance of the classifier and the recall rate of normal text messages were improved.

Balanced Winnow has a great advantage in training speed and classification speed, so it has higher practical value and is very suitable for SMS filtering. In addition, as an online learning method, Winnow can quickly update the classifier when the training set is constantly expanding. It is based on Winnow's continuous learning and adjustment mechanism that makes it very suitable for users to customize the classification standards they need. With the continuous feedback and adjustment of users, the entire system will show better and better results.

Reference address：SMS filtering system based on improved balanced Winnow algorithm

Previous article：Transplantation of Support Vector Machine Speech Recognition Algorithm on OMAP5912
Next article：Research and Design of Arbitrary Waveform Generator Using FPGA

Popular Resources
Popular amplifiers