Design of Crawler Based on HTML Parser Information Extraction

2013-09-22
169.49KB
Points it Requires : 2

repReport

Document Introduction
You Might Like
Recommended Downloads

Whether it is general search or vertical search, one of the key core technologies is the design of web crawlers. This paper combines the HTMLParser information extraction method to conduct a detailed study of web crawlers in life-related vertical search engines. By deeply analyzing the tree structure of life-related website URLs, a simulation searcher for collecting seed page URLs was developed, and based on the HTMLParser information extraction method, target URLs related to life-related topics were extracted from seed pages. Experimental tests have shown that the crawler\'s crawling accuracy rate reached 93.552% and the crawling completeness rate reached 96.720%, indicating that the web crawler is effective and meets the requirements of medium-sized vertical search enterprise-level applications. Keywords：web crawler; vertical search engine; HTMLParserAbstract：Whether general search engine or vertical search engine, the design of web crawler is the core technology. In this article, a novel system of life-theme web crawler based on HTMLParser information extraction is thoroughly studied. In this system, a simulation searcher is designed for collecting the seed URL by analyzing tree structure of life-theme website, then, based on the discussion of HTMLParser information extraction, the target URL that relate to life-theme is extracted from the seed pages. Empirical studies show that the Pr ecision = 93.552% and the Re call = 96.720%, proving its effectiveness and achieving requirements for general enterprise-level application of vertical search engine.Key words：web crawler; vertical search engine;HTMLParser

unfold

Design of Crawler Based on HTML Parser Information Extraction

Document Introduction