How to use image recognition, speech recognition, and text mining to identify pornography?
Join Leifeng.com, share the information dividend of the AI era, and walk with the intelligent future. I heard that all the great people have clicked here .
Leifeng.com: Competition in the market for artificial intelligence pornography identification is becoming increasingly fierce. Currently, teams such as Tupu Technology, Alibaba Green Network, and Tencent Wanxiang Youtu have occupied a large market share. In this environment, many companies are trying to get a piece of the pie in this red ocean by providing more comprehensive services.
So where are the more comprehensive customized services reflected? Leifeng.com specially interviewed Lei Zhen, CEO of Jijiyuan. Lei Zhen explained AI porn detection to Leifeng.com from three dimensions: image recognition, voice recognition, and text mining, and also elaborated on some engineering details.
What aspects are generally considered when identifying pornographic content in live broadcasts?
Usually, pornographic content can be intelligently identified through video screenshots, image recognition, voice technical review, bullet screen monitoring, keyword extraction, etc. Before officially providing image recognition services to customers, live broadcast platform users will be invited to conduct experience tests and collect some live broadcast platform-specific feature data, such as different live broadcast backgrounds, ambient light intensity, topic content, etc., to conduct customized training models. Different live broadcast platforms will receive customized exclusive image recognition services.
The review and appraisal of live video content can be carried out in the following steps: identifying whether there are human body features in the image and counting the number of people; identifying the gender and age range of the people in the image; identifying the skin color and degree of exposure of the limbs; identifying the body contours and analyzing the movements; in addition to image recognition, key features can be extracted from the audio information to determine whether there is sensitive information; real-time analysis of the barrage text content to determine whether there is any violation in the current video and dynamically adjust the image acquisition frequency.
In terms of image recognition, the frequency of capturing key frames per minute of video can be set by the customer, from 1 second to dozens of seconds. For example, the default setting is to capture key frames every 5 seconds for recognition, or dynamically adjust the capture frequency to one frame per second when a suspected alarm occurs.
You just mentioned audio key feature extraction. Can you elaborate on this?
Audio analysis mainly includes the following aspects:
-
Through voiceprint recognition technology, it is determined whether the anchor in the current live broadcast room is the registered anchor himself, and the anchor's identity is identified.
-
Perform keyword search on the host's voice content to check whether there are banned words or sensitive words.
-
Identify specific continuous speech data segments to see if they contain any harmful information.
-
Collect statistics on the broadcast frequency of spoken advertisements and analyze the effectiveness of advertising.
However, the solution of video and audio dual-channel detection is decided by the user. For live show broadcasts, image detection can usually meet most of the needs, and audio detection may be more suitable for live broadcast platforms with voice content as the main content. Combining the two will greatly improve the recognition accuracy and reduce the false alarm rate, but the cost will also increase accordingly, so users can choose according to business needs.
What are the current accuracy, false positive rate, and recall rate? Will manual review be performed?
Currently, the accuracy of pornographic image detection on live streaming platforms is as high as over 99%, with a false alarm rate of less than 1%, and the proportion of cases requiring manual review by customers does not exceed 3%. Manual review services are usually not provided, but suspected images will be marked and users will be reminded to conduct manual review. Data after manual review will be collected for iterative training, which can continuously improve the accuracy of recognition.
The real-time nature of live broadcasting requires a very high speed of image recognition and processing by the machine. Will it require a very high computing power of the machine? What kind of processing method is used?
Live streaming of online videos is highly real-time, and has particularly high requirements for the speed of image recognition processing on the server side. In addition to high requirements for bandwidth, it also requires the recognition server to have strong GPU computing capabilities, especially when applying deep machine learning algorithms for model training. Powerful GPU cluster servers are indispensable, and based on the characteristics of the full link layer, the restrictions on the size of training images are removed to quickly improve the algorithm processing speed. In addition, when collecting video images, you can also use the method of dynamically adjusting the collection frequency. Usually, one frame is a few seconds. When sensitive information appears, the collection frequency is accelerated, so that pornographic information can be identified more promptly and an alarm can be issued.
How much data is needed for model training? What factors generally affect the accuracy of identification?
Taking Extreme Metadata as an example, the basic data set has tens of millions of images, and 20,000 positive and negative sample images of various types are added every day for iterative training to continuously fine-tune and optimize the recognition accuracy. Basic model training is performed once a week, and incremental model iterative training is performed every 1-2 days.
As for the impact on identification accuracy, the main reason is the lack of data. The incomplete coverage of application scenarios by samples leads to false positives, missed negatives or recognition errors in the trained models. As deep machine learning algorithms become more mature, the diversity and professionalism of data sources have become the top priority in model construction.
In addition, the host deliberately uses some means to interfere with detection, such as blocking sensitive parts, picture-in-picture, etc., which will also affect the machine's recognition and judgment to a certain extent.
Can the machine automatically handle: blocking, deleting, banning, etc.?
The pornographic image detection service is deployed in the cloud, and has no network path to access the user's live broadcast room management system, so it cannot automatically block, delete, or pause the activities of the live broadcast room. However, if the user chooses a private cloud deployment method and authorizes the recognition server to access the live broadcast room management system, then the deletion and suspension of pornographic live broadcast rooms can be realized.
How much does the cost of intelligent porn detection reduce compared to manual porn detection?
Take a small or medium-sized live broadcast platform with 100,000 hours of live broadcast per month as an example. If traditional content review technology is used, the cost of a 100-person content management team is around 800,000 yuan per month. If artificial intelligence is used for content monitoring, the manpower investment can be reduced to about 10 people, and the comprehensive investment is only between 100,000 and 200,000 yuan, which will greatly reduce labor costs and management expenses. In addition, there are also savings in monitoring equipment fees, office space fees, etc.
How to grasp and determine the boundary between pornography and non-pornography?
First of all, when building such a classification model, there will be manual annotation of the image big data, which will result in a certain amount of subjective judgment error, but it is also within the scope of public understanding. In addition to pornography and normality, there is also a category called suspected or sexy, which are matched based on the approximate values after machine recognition.