2750 views|2 replies

381

Posts

9

Resources
The OP
 

[Mil MYS-8MMX] Mil MYS-8MMQ6-8E2D-180-C Application 3 - NLP Part of Speech Analysis and Application [Copy link]

[Mil MYS-8MMX] Mil MYS-8MMQ6-8E2D-180-C Application 3 - NLP Part of Speech Analysis and Application

In the previous article, we talked about using jieba for word segmentation. In this article, we will continue to study the use of jieba.

In Jieba, there is another very important function, which is to mark the parts of speech, and support marking different parts of speech. The parts of speech in modern Chinese are divided into four categories: content words, function words, interjections, and onomatopoeia.

Content words (words with actual meanings, which can independently serve as sentence components, that is, they have lexical and grammatical meanings), include substantive words (nouns, numerals and quantifiers), predicates (verbs and adjectives), adjectives (adverbs) and pronouns (whose main function is substitution, and they can replace nouns, numerals, quantifiers, verbs, adjectives and adverbs. The grammatical functions are different depending on the objects replaced).

Function words (words that do not have complete meaning but have grammatical meaning or function. They must be attached to content words or sentences to express grammatical meaning, and cannot form a sentence alone, serve as a grammatical component alone, or overlap), include relative words (conjunctions and prepositions) and auxiliary words (particles and modal particles).

Onomatopoeia and interjections are neither content words nor function words, but are classified as special word classes. Their characteristics are that they usually do not have a structural relationship with other words in a sentence.

In NLP, in addition to word segmentation, language analysis can also be done by tagging parts of speech. Taking Jieba as an example, using the default vocabulary, the commonly used tags are:

x: punctuation mark

eng: English words

a: adjective

n: name

nr:name

ns : place name

nt Institutional Group

r: pronoun

t: time

f: Direction

Let’s still analyze War and Peace and see how many names are mentioned in it. We will not consider names mentioned less than 15 times.

It can be seen that jieba's recognition of word parts of speech is not particularly accurate, and "army" and "marshal" are all recognized as names.

Maybe it's because of the foreign language? Let's find a martial arts novel, like "The Demi-Gods and Semi-Devils"?

It can be seen that not only the part-of-speech tagging is problematic, but even the participles are wrong, for example: to Xiao Feng, to Tong Lao, etc., which obviously need to be corrected.

In the next article, we will consider introducing a custom dictionary to achieve the same function and make dictionary optimization

This post is from Embedded System

Latest reply

After watching a section, I felt like I was taking a Chinese class.   Details Published on 2021-9-6 10:07

6547

Posts

0

Resources
2
 

Mark the parts of speech, support marking different parts of speech, this function is very powerful

This post is from Embedded System
 
 

2w

Posts

74

Resources
3
 

After watching a section, I felt like I was taking a Chinese class.

This post is from Embedded System
Add and join groups EEWorld service account EEWorld subscription account Automotive development circle
 
Personal signature

加油!在电子行业默默贡献自己的力量!:)

 
 

Just looking around
Find a datasheet?

EEWorld Datasheet Technical Support

EEWorld
subscription
account

EEWorld
service
account

Automotive
development
circle

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京B2-20211791 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号
快速回复 返回顶部 Return list