Research and Design of Intelligent WEB Information Extraction System

2013-09-19
199.69KB
Points it Requires : 2

Download

repReport

Document Introduction
You Might Like
Recommended Downloads

XML has become the standard for WEB data publishing and exchange. Wrapper technology provides an important implementation step for data mining. Intelligent agent technology plays an important role in controlling and coordinating mining with its intelligent and agent characteristics. This paper organically combines these three standards and technologies and applies them to WEB data mining. With the help of the J2EE three-tier architecture idea, an intelligent WEB information extraction implementation scheme is given, and the process of the system processing user mining requests is briefly explained, reflecting the system\'s strong intelligent understanding and generalization capabilities. With the rapid development of Internet technology, various types of information on the Internet have grown exponentially, forming such an embarrassing situation: on the one hand, the amount of information is astonishingly large; on the other hand, people have to spend a lot of time and energy to find the information they need. From this point of view, the mining and extraction of massive information is of great significance. This paper proposes a system that can automatically extract data from ultra-large data-intensive WEB sites. Popular e-commerce, finance, certain scientific organizations and associations, or news and entertainment sites not only have a large amount of information, but also update data very quickly. Most of these sites are composed of many HTML pages with complex hyperlinks, which realize full left-click operation and get what you click (which is also the starting point of the current \"network desktop environment\"), making it easy and fast to obtain information. However, due to the representation logic of the WEB page itself and the complex links between web pages, it is very technically difficult to build large-scale applications or systems based on the above information sources. So, can we consider changing the page representation to solve this problem? In this regard, some solutions have been proposed recently, mainly from the perspective of data mining. After long-term and careful observation, it is found that many current WEB sites contain a large number of WEB pages with very similar structures, and it is predicted that the above sites will maintain this structure for some time in the future. In response to this fact, some researchers have verified relevant technologies and proposed WEB wrappers [1,2,3] and wrapper libraries [4] that can extract data from HTML pages. That is, after inputting a set of WEB pages with a common template, a wrapper that can extract core data from the aforementioned WEB page set with a common template can be obtained. The above research results provide a partial solution for this article. The article organically combines XML[5,6,8], wrapper, and intelligent agent[7] technologies or standards and applies them to data mining. It improves the multi-agent joint collaboration theory of the article[7] and provides an intelligent data extraction implementation solution.

unfold

Research and Design of Intelligent WEB Information Extraction System

Document Introduction