Article count:10350 Read by:146647018

Account Entry

Resource alert! Someone has collected 40 Chinese NLP lexicons and put them on GitHub

Latest update time:2018-11-16
    Reads:
乾明 编辑整理
量子位 出品 | 公众号 QbitAI

Are you still worried about not being able to find a vocabulary for Chinese NLP?

Are you still scratching your head to extract structured information from text?

Now, these symptoms can be relieved.

Recently, someone collected a resource on GitHub, which brought together 40 Chinese NLP lexicons covering various aspects.

Chinese and English sensitive words, language detection, Chinese and foreign mobile phone/telephone location/operator query, name inference gender, mobile phone number extraction, ID card extraction, email extraction, Chinese and Japanese literati name database, Chinese abbreviation database, character decomposition dictionary.

Vocabulary sentiment value, stop words, reactionary word list, violent and terrorist word list, traditional and simplified Chinese conversion, English simulation of Chinese pronunciation, Wang Feng lyrics generator, occupation name thesaurus, synonym thesaurus, antonym thesaurus.

Negative vocabulary, car brand vocabulary, car parts vocabulary, continuous English cutting, various Chinese word vectors, company name dictionary, ancient poetry vocabulary, IT vocabulary, financial vocabulary, idiom vocabulary.

Place name database, historical celebrity database, poetry database, medical database, food database, legal database, automobile database, animal database, Chinese chat corpus, and Chinese rumor data.

Currently, this resource has over 700 stars on GitHub.

The person who collected this resource is nicknamed "Yang" on GitHub, and the note shows that he is a doctor from Beihang University. He also opened a column on Zhihu to introduce some small knowledge about machine learning.

In the resource provided by Yang, he not only provides some vocabulary, but also provides the usage of 32 vocabulary.

For example, filtering sensitive words in Chinese and English:

 >>> f = DFAFilter()
 >>> f.add("sexy")
 >>> f.filter("hello sexy baby")
 hello **** baby

Sometimes, he would give hints for some vocabulary. For this vocabulary, he gave the following hint:

Sensitive words include political, swear words and other topical words. Its principle is mainly based on dictionary search (keyword files in the project), and the content is very explosive. . .

Here is another example of judging gender based on name:

pip install ngender # Probability calculated based on Naive Bayes

>>> import ngender
>>> ngender.guess('赵本山')
('male'0.9836229687547046)
>>> ngender.guess('宋丹丹')
('female'0.9759486128949907)

The other 30 are omitted here...If you are interested, you can go and have a look and save them for future reference.

Portal:
https://github.com/fighting41love/funNLP

The author’s Zhihu column address:
https://zhuanlan.zhihu.com/yangyangfuture

-over-

Annual selection application

Join the community

The QuantumBit AI community has started recruiting. Students who are interested in AI are welcome to reply to the keyword "communication group" in the dialogue interface of the QuantumBit public account (QbitAI) to obtain the way to join the group;


In addition, professional qubit sub-groups (autonomous driving, CV, NLP, machine learning, etc.) are recruiting for engineers and researchers working in related fields.


To join the professional group, please reply to the keyword "professional group" in the dialogue interface of the Quantum Bit public account (QbitAI) to obtain the entry method. (The professional group has strict review, please understand)

Event Planning Recruitment

QuantumBit is recruiting event planners who will be responsible for the planning and execution of online and offline related events in different fields and dimensions. Welcome smart and reliable partners to join us, and hope that you have some relevant experience in event planning or operation. For relevant details, please reply to the word "recruitment" in the dialogue interface of the QuantumBit public account (QbitAI).

Quantum Bit QbitAI · Toutiao signed author

Tracking new trends in AI technology and products



Latest articles about

 
EEWorld WeChat Subscription

 
EEWorld WeChat Service Number

 
AutoDevelopers

About Us Customer Service Contact Information Datasheet Sitemap LatestNews

Room 1530, Zhongguancun MOOC Times Building,Block B, 18 Zhongguancun Street, Haidian District,Beijing, China Tel:(010)82350740 Postcode:100190

Copyright © 2005-2024 EEWORLD.com.cn, Inc. All rights reserved 京ICP证060456号 京ICP备10001474号-1 电信业务审批[2006]字第258号函 京公网安备 11010802033920号