[Mil MYS-8MMX] Mil MYS-8MMQ6-8E2D-180-C Application 3 - NLP Part of Speech Analysis and Application
In the previous article, we talked about using jieba for word segmentation. In this article, we will continue to study the use of jieba.
In Jieba, there is another very important function, which is to mark the parts of speech, and support marking different parts of speech. The parts of speech in modern Chinese are divided into four categories: content words, function words, interjections, and onomatopoeia.
Content words (words with actual meanings, which can independently serve as sentence components, that is, they have lexical and grammatical meanings), include substantive words (nouns, numerals and quantifiers), predicates (verbs and adjectives), adjectives (adverbs) and pronouns (whose main function is substitution, and they can replace nouns, numerals, quantifiers, verbs, adjectives and adverbs. The grammatical functions are different depending on the objects replaced).
Function words (words that do not have complete meaning but have grammatical meaning or function. They must be attached to content words or sentences to express grammatical meaning, and cannot form a sentence alone, serve as a grammatical component alone, or overlap), include relative words (conjunctions and prepositions) and auxiliary words (particles and modal particles).
Onomatopoeia and interjections are neither content words nor function words, but are classified as special word classes. Their characteristics are that they usually do not have a structural relationship with other words in a sentence.
In NLP, in addition to word segmentation, language analysis can also be done by tagging parts of speech. Taking Jieba as an example, using the default vocabulary, the commonly used tags are:
x: punctuation mark
eng: English words
a: adjective
n: name
nr:name
ns : place name
nt Institutional Group
r: pronoun
t: time
f: Direction
Let’s still analyze War and Peace and see how many names are mentioned in it. We will not consider names mentioned less than 15 times.
It can be seen that jieba's recognition of word parts of speech is not particularly accurate, and "army" and "marshal" are all recognized as names.
Maybe it's because of the foreign language? Let's find a martial arts novel, like "The Demi-Gods and Semi-Devils"?
It can be seen that not only the part-of-speech tagging is problematic, but even the participles are wrong, for example: to Xiao Feng, to Tong Lao, etc., which obviously need to be corrected.
In the next article, we will consider introducing a custom dictionary to achieve the same function and make dictionary optimization