[Mil MYS-8MMX] Mil MYS-8MMQ6-8E2D-180-C Application 2 - A Preliminary Study on NLP
The application of natural language (NL) to machine language (ML) is currently a popular direction. One of its branches is how to enable machines to recognize a sentence of human speech, including context, semantics, emotions, etc.
The most important part is the sentence segmentation. Today we try to use the Mil MYS-8MMQ6-8E2D-180-C to try the sentence segmentation.
The NLP library I tried today is jieba. Install the library file. Because direct installation may cause connection exceptions, you need to specify the source:
pip3 install jieba -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
Similarly, install jieba on python2 with the command:
pip install jieba -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
Let’s try a simpler one first: “The bicycle was about to fall over, so I grabbed it by the handlebars.” The pronunciations of the Chinese characters “把” in this sentence are different, and it means that I grabbed (the preposition) “把” (read the fourth tone, referring to the handlebars of a bicycle) by the handlebars.
What’s more interesting is that python2 can be executed normally, but python3 will fail, which seems to be an error in the re module.
Let’s count the words in famous works and get the 20 most frequent words. Again, python3 is not available, but it runs successfully under python2.
In our experimental example, we chose "War and Peace", the content of which is very familiar to everyone, so we will not increase the word count.
From the above example, we can see that jieba also cuts out the symbols separately. Single-word words are not very meaningful, so we can directly discard words with a length of 1 (including punctuation marks). According to Chinese rules, we can select "stop word list", which can be downloaded at https://gitee.com/chen_kailun/stopwords. There are four commonly used stop word lists in Chinese:
Vocabulary Name
|
Vocabulary file
|
Chinese stop words list
|
cn_stopwords.txt
|
HIT stop word list
|
hit_stopwords.txt
|
Baidu stop word list
|
baidu_stopwords.txt
|
Stopword Library of Machine Intelligence Laboratory of Sichuan University
|
scu_stopwords.txt
|
Select "Baidu Stop Words List" and directly call the functions textrank and extract_tags in jieba to obtain keywords and compare them with the high-frequency words we selected.
It can be seen that there are some overlapping contents, such as: "Duke" (Andre is indeed the real protagonist), and there are more different keywords. The method of selecting keywords by jieba is unclear, but it may not be a simple and crude selection of the most frequently appearing words as keywords.
In addition, I feel that the performance of single-board computers is still too slow compared to laptops. The same code is executed on a computer, but it only takes seconds to ten seconds. I still use Raspberry Pi for comparison. The Raspberry Pi 4 at hand also installs jieba on python2. The same code is tested:
Comparing the above two results, we find that the Raspberry Pi 4b only needs half the time (71/177, 174/466, 16/34) to do the same work, which is different from the result we tested with Pi before, that MYS-8MMQ6-8E2D-180-C is only slightly weaker than Raspberry Pi 4b (see: https://en.eeworld.com/bbs/thread-1175554-1-1.html).
In addition, in the running result of MYS-8MMQ6-8E2D-180-C, "East:530" strangely became ":11679". I don't know if it is a coding error.
getkeyword.py
(2.65 KB, downloads: 0, 售价: 10 分芯积分)