Resource alert! Someone has collected 40 Chinese NLP lexicons and put them on GitHub
乾明 编辑整理
量子位 出品 | 公众号 QbitAI
Are you still worried about not being able to find a vocabulary for Chinese NLP?
Are you still scratching your head to extract structured information from text?
Now, these symptoms can be relieved.
Recently, someone collected a resource on GitHub, which brought together 40 Chinese NLP lexicons covering various aspects.
Chinese and English sensitive words, language detection, Chinese and foreign mobile phone/telephone location/operator query, name inference gender, mobile phone number extraction, ID card extraction, email extraction, Chinese and Japanese literati name database, Chinese abbreviation database, character decomposition dictionary.
Vocabulary sentiment value, stop words, reactionary word list, violent and terrorist word list, traditional and simplified Chinese conversion, English simulation of Chinese pronunciation, Wang Feng lyrics generator, occupation name thesaurus, synonym thesaurus, antonym thesaurus.
Negative vocabulary, car brand vocabulary, car parts vocabulary, continuous English cutting, various Chinese word vectors, company name dictionary, ancient poetry vocabulary, IT vocabulary, financial vocabulary, idiom vocabulary.
Place name database, historical celebrity database, poetry database, medical database, food database, legal database, automobile database, animal database, Chinese chat corpus, and Chinese rumor data.
Currently, this resource has over 700 stars on GitHub.
The person who collected this resource is nicknamed "Yang" on GitHub, and the note shows that he is a doctor from Beihang University. He also opened a column on Zhihu to introduce some small knowledge about machine learning.
In the resource provided by Yang, he not only provides some vocabulary, but also provides the usage of 32 vocabulary.
For example, filtering sensitive words in Chinese and English:
>>> f = DFAFilter()
>>> f.add("sexy")
>>> f.filter("hello sexy baby")
hello **** baby
Sometimes, he would give hints for some vocabulary. For this vocabulary, he gave the following hint:
Sensitive words include political, swear words and other topical words. Its principle is mainly based on dictionary search (keyword files in the project), and the content is very explosive. . .
Here is another example of judging gender based on name:
pip install ngender # Probability calculated based on Naive Bayes
>>> import ngender
>>> ngender.guess('赵本山')
('male', 0.9836229687547046)
>>> ngender.guess('宋丹丹')
('female', 0.9759486128949907)
The other 30 are omitted here...If you are interested, you can go and have a look and save them for future reference.
Portal:
https://github.com/fighting41love/funNLP
The author’s Zhihu column address:
https://zhuanlan.zhihu.com/yangyangfuture
-over-
Annual selection application
Join the community
The QuantumBit AI community has started recruiting. Students who are interested in AI are welcome to reply to the keyword "communication group" in the dialogue interface of the QuantumBit public account (QbitAI) to obtain the way to join the group;
In addition, professional qubit sub-groups (autonomous driving, CV, NLP, machine learning, etc.) are recruiting for engineers and researchers working in related fields.
To join the professional group, please reply to the keyword "professional group" in the dialogue interface of the Quantum Bit public account (QbitAI) to obtain the entry method. (The professional group has strict review, please understand)
Event Planning Recruitment
QuantumBit is recruiting event planners who will be responsible for the planning and execution of online and offline related events in different fields and dimensions. Welcome smart and reliable partners to join us, and hope that you have some relevant experience in event planning or operation. For relevant details, please reply to the word "recruitment" in the dialogue interface of the QuantumBit public account (QbitAI).
Quantum Bit QbitAI · Toutiao signed author
Tracking new trends in AI technology and products