Whether it is general search or vertical search, one of the key core technologies is the design of web crawlers. This paper combines the HTMLParser information extraction method to conduct a detailed study of web crawlers in life-related vertical search engines. By deeply analyzing the tree structure of life-related website URLs, a simulation searcher for collecting seed page URLs was developed, and based on the HTMLParser information extraction method, target URLs related to life-related topics were extracted from seed pages. Experimental tests have shown that the crawler\'s crawling accuracy rate reached 93.552% and the crawling completeness rate reached 96.720%, indicating that the web crawler is effective and meets the requirements of medium-sized vertical search enterprise-level applications. Keywords:web crawler; vertical search engine; HTMLParserAbstract:Whether general search engine or vertical search engine, the design of web crawler is the core technology. In this article, a novel system of life-theme web crawler based on HTMLParser information extraction is thoroughly studied. In this system, a simulation searcher is designed for collecting the seed URL by analyzing tree structure of life-theme website, then, based on the discussion of HTMLParser information extraction, the target URL that relate to life-theme is extracted from the seed pages. Empirical studies show that the Pr ecision = 93.552% and the Re call = 96.720%, proving its effectiveness and achieving requirements for general enterprise-level application of vertical search engine.Key words:web crawler; vertical search engine;HTMLParser
You Might Like
Recommended ContentMore
Open source project More
Popular Components
Searched by Users
Just Take a LookMore
Trending Downloads
Trending ArticlesMore