[From Search Engine Direct]
With the rapid development of the Internet and the increase of web information, users need to find information in the ocean of information, which is like looking for a needle in a haystack. Search engine technology just solves this problem (it can provide information retrieval services for users). At present, search engine technology is becoming the object of research and development in the computer industry and academia. Search Engine is a technology that has been gradually developed since 1995 as the web information increases rapidly. According to the article "Accessibility of Web Information" published in Science magazine in July 1999, there are more than 800 million web pages in the world, more than 9TB of effective data, and it is still doubling every 4 months. If users want to find information in such a vast ocean of information, they will inevitably return empty-handed. Search engines are a technology that was created to solve this "lost" problem. Search engines use certain strategies to collect and discover information on the Internet, understand, extract, organize and process information, and provide retrieval services to users, thereby achieving the purpose of information navigation. The navigation service provided by search engines has become a very important network service on the Internet, and search engine sites are also known as "network portals." Search engine technology has therefore become the object of research and development in the computer industry and academia. This article aims to give a brief introduction to the key technologies of search engines in order to serve as a starting point for further discussion. 1. Classification According to the different information collection methods and service provision methods, search engine systems can be divided into three categories: 1. Directory search engine: Information is collected manually or semi-automatically. After the editor reviews the information, he or she manually forms an information summary and places the information in a predetermined classification framework. Most of the information is oriented to websites, providing directory browsing services and direct retrieval services. Because this type of search engine incorporates human intelligence, the information is accurate and the navigation quality is high. The disadvantages are that it requires manual intervention, a large amount of maintenance, a small amount of information, and untimely information updates. Representatives of this type of search engine are: Yahoo, LookSmart, Open Directory, Go Guide, etc. 2. Robot search engine: A robot program called a spider automatically collects and discovers information on the Internet with a certain strategy. The indexer creates an index for the collected information. The retriever searches the index library according to the user's query input and returns the query results to the user. The service mode is a full-text search service for web pages. The advantages of this type of search engine are large amount of information, timely updates, and no need for human intervention. The disadvantages are that too much information is returned, there is a lot of irrelevant information, and users must filter from the results. Representatives of this type of search engine are: AltaVista, Northern Light, Excite, Infoseek, Inktomi, FAST, Lycos, Google; domestic representatives are: "Skynet", Youyou, OpenFind, etc. 3. Meta search engine: This type of search engine does not have its own data, but submits the user's query request to multiple search engines at the same time, removes duplicates, re-sorts the returned results, and returns them to the user as its own results. The service method is full-text search for web pages. The advantage of this type of search engine is that the amount of information returned is larger and more complete. The disadvantage is that the functions of the search engine used cannot be fully utilized, and users need to do more screening. Representatives of this type of search engine are WebCrawler, InfoMarket, etc. 2. Performance indicators We can regard the search of WEB information as an information retrieval problem, that is, to retrieve documents related to the user's query from a document library composed of WEB pages. Therefore, we can use the performance parameters of traditional information retrieval systems - recall and precision to measure the performance of a search engine. Recall is the ratio of the number of relevant documents retrieved to the number of all relevant documents in the document library, which measures the recall rate of the retrieval system (search engine); precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved, which measures the precision rate of the retrieval system (search engine). For a retrieval system, recall and precision cannot be the best of both worlds: when the recall rate is high, the precision is low, and when the precision is high, the recall rate is low. Therefore, the average of 11 precisions under 11 recall rates (i.e., 11-point average precision) is often used to measure the precision of a retrieval system. For search engine systems, because no search engine system can collect all WEB pages, recall is difficult to calculate. Current search engine systems are very concerned about precision. There are many factors that affect the performance of a search engine system, the most important of which is the information retrieval model, including the representation method of documents and queries, the matching strategy for evaluating the relevance of documents and user queries, the ranking method of query results, and the mechanism for users to provide relevance feedback. 3. Main technologies A search engine consists of four parts: search engine, indexer, retriever and user interface. 1. Search Engine The function of a search engine is to roam the Internet, discover and collect information. It is often a computer program that runs day and night. It needs to collect new information of various types as much as possible and as quickly as possible. At the same time, because the information on the Internet is updated very quickly, it is also necessary to regularly update the old information that has been collected to avoid dead links and invalid links. There are currently two strategies for collecting information: ● Start from a set of starting URLs and follow the hyperlinks in these URLs to discover information on the Internet in a breadth-first, depth-first or heuristic manner. These starting URLs can be any URL, but are often very popular sites with many links (such as Yahoo!). ● Divide the Web space by domain name, IP address or country domain name, and each searcher is responsible for exhaustive search of a subspace. The search engine collects a variety of information types, including HTML, XML, Newsgroup articles, FTP files, word processing documents, and multimedia information. The implementation of search engines often uses distributed and parallel computing technology to increase the speed of information discovery and update. Commercial search engines can discover millions of web pages every day. The function of the indexer is to understand the information searched by the search engine, extract index items from it, and use them to represent documents and generate index tables for document libraries. There are two types of index items: objective index items and content index items. Objective index items are irrelevant to the semantic content of the document, such as author name, URL, update time, encoding, length, link popularity, etc.; content index items are used to reflect the content of the document, such as keywords and their weights, phrases, single words, etc. Content index items can be divided into single index items and multiple index items (or phrase index items). For English, single index items are English words, which are easier to extract because there are natural separators (spaces) between words; for Chinese and other continuously written languages, word segmentation must be performed. In search engines, a single index item is generally assigned a weight to indicate the index item's degree of discrimination for the document, and is also used to calculate the relevance of the query results. The methods generally used are statistical methods, information theory methods, and probabilistic methods. The methods for extracting phrase index items are statistical methods, probabilistic methods, and linguistic methods.
|