Search engine technology and trends

feifei

Search engine technology and trends [Copy link]

[From Search Engine Direct]

With the rapid development of the Internet and the increase of web information, users need to find information in the ocean of information, which is like looking for a needle in a haystack. Search engine technology just solves this problem (it can provide information retrieval services for users). At present, search engine technology is becoming the object of research and development in the computer industry and academia.

　Search Engine is a technology that has been gradually developed since 1995 as the web information increases rapidly. According to the article "Accessibility of Web Information" published in Science magazine in July 1999, there are more than 800 million web pages in the world, more than 9TB of effective data, and it is still doubling every 4 months. If users want to find information in such a vast ocean of information, they will inevitably return empty-handed.

　Search engines are a technology that was created to solve this "lost" problem. Search engines use certain strategies to collect and discover information on the Internet, understand, extract, organize and process information, and provide retrieval services to users, thereby achieving the purpose of information navigation. The navigation service provided by search engines has become a very important network service on the Internet, and search engine sites are also known as "network portals." Search engine technology has therefore become the object of research and development in the computer industry and academia. This article aims to give a brief introduction to the key technologies of search engines in order to serve as a starting point for further discussion.

1. Classification

　According to the different information collection methods and service provision methods, search engine systems can be divided into three categories:

　1. Directory search engine: Information is collected manually or semi-automatically. After the editor reviews the information, he or she manually forms an information summary and places the information in a predetermined classification framework. Most of the information is oriented to websites, providing directory browsing services and direct retrieval services. Because this type of search engine incorporates human intelligence, the information is accurate and the navigation quality is high. The disadvantages are that it requires manual intervention, a large amount of maintenance, a small amount of information, and untimely information updates. Representatives of this type of search engine are: Yahoo, LookSmart, Open Directory, Go Guide, etc.

　2. Robot search engine: A robot program called a spider automatically collects and discovers information on the Internet with a certain strategy. The indexer creates an index for the collected information. The retriever searches the index library according to the user's query input and returns the query results to the user. The service mode is a full-text search service for web pages. The advantages of this type of search engine are large amount of information, timely updates, and no need for human intervention. The disadvantages are that too much information is returned, there is a lot of irrelevant information, and users must filter from the results. Representatives of this type of search engine are: AltaVista, Northern Light, Excite, Infoseek, Inktomi, FAST, Lycos, Google; domestic representatives are: "Skynet", Youyou, OpenFind, etc.

　3. Meta search engine: This type of search engine does not have its own data, but submits the user's query request to multiple search engines at the same time, removes duplicates, re-sorts the returned results, and returns them to the user as its own results. The service method is full-text search for web pages. The advantage of this type of search engine is that the amount of information returned is larger and more complete. The disadvantage is that the functions of the search engine used cannot be fully utilized, and users need to do more screening. Representatives of this type of search engine are WebCrawler, InfoMarket, etc.
2. Performance indicators

　　We can regard the search of WEB information as an information retrieval problem, that is, to retrieve documents related to the user's query from a document library composed of WEB pages. Therefore, we can use the performance parameters of traditional information retrieval systems - recall and precision to measure the performance of a search engine.
Recall is the ratio of the number of relevant documents retrieved to the number of all relevant documents in the document library, which measures the recall rate of the retrieval system (search engine); precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved, which measures the precision rate of the retrieval system (search engine). For a retrieval system, recall and precision cannot be the best of both worlds: when the recall rate is high, the precision is low, and when the precision is high, the recall rate is low. Therefore, the average of 11 precisions under 11 recall rates (i.e., 11-point average precision) is often used to measure the precision of a retrieval system. For search engine systems, because no search engine system can collect all WEB pages, recall is difficult to calculate. Current search engine systems are very concerned about precision.

　　There are many factors that affect the performance of a search engine system, the most important of which is the information retrieval model, including the representation method of documents and queries, the matching strategy for evaluating the relevance of documents and user queries, the ranking method of query results, and the mechanism for users to provide relevance feedback.

3. Main technologies

　A search engine consists of four parts: search engine, indexer, retriever and user interface.

　1. Search Engine

　　The function of a search engine is to roam the Internet, discover and collect information. It is often a computer program that runs day and night. It needs to collect new information of various types as much as possible and as quickly as possible. At the same time, because the information on the Internet is updated very quickly, it is also necessary to regularly update the old information that has been collected to avoid dead links and invalid links. There are currently two strategies for collecting information:

　● Start from a set of starting URLs and follow the hyperlinks in these URLs to discover information on the Internet in a breadth-first, depth-first or heuristic manner. These starting URLs can be any URL, but are often very popular sites with many links (such as Yahoo!).

　● Divide the Web space by domain name, IP address or country domain name, and each searcher is responsible for exhaustive search of a subspace.

　The search engine collects a variety of information types, including HTML, XML, Newsgroup articles, FTP files, word processing documents, and multimedia information.

The implementation of search engines often uses distributed and parallel computing technology to increase the speed of information discovery and update. Commercial search engines 　can discover millions of web pages every day.

　The function of the indexer is to understand the information searched by the search engine, extract index items from it, and use them to represent documents and generate index tables for document libraries.
　 There are two types of index items: objective index items and content index items. Objective index items are irrelevant to the semantic content of the document, such as author name, URL, update time, encoding, length, link popularity, etc.; content index items are used to reflect the content of the document, such as keywords and their weights, phrases, single words, etc. Content index items can be divided into single index items and multiple index items (or phrase index items). For English, single index items are English words, which are easier to extract because there are natural separators (spaces) between words; for Chinese and other continuously written languages, word segmentation must be performed.
　In search engines, a single index item is generally assigned a weight to indicate the index item's degree of discrimination for the document, and is also used to calculate the relevance of the query results. The methods generally used are statistical methods, information theory methods, and probabilistic methods. The methods for extracting phrase index items are statistical methods, probabilistic methods, and linguistic methods.

feifei

The index table generally uses some form of inversion list, that is, the corresponding document is found by the index item. The index table may also record the position of the index item in the document so that the searcher can calculate the proximity between the index items.

　The indexer can use a centralized indexing algorithm or a distributed indexing algorithm. When the amount of data is large, instant indexing must be implemented, otherwise it will not be able to keep up with the rapid increase in the amount of information. The indexing algorithm has a great impact on the performance of the indexer (such as the response speed during large-scale peak queries). The effectiveness of a search engine depends largely on the quality of the index.

　3. Retriever

　　The function of the retriever is to quickly retrieve documents from the index library based on the user's query, evaluate the relevance of the document to the query, sort the results to be output, and implement some kind of user relevance feedback mechanism.

　There are four types of information retrieval models commonly used by retrievers: set theory model, algebraic model, probabilistic model and hybrid model.

　4. User Interface

　The role of the user interface is to input user queries, display query results, and provide users with a relevance feedback mechanism. The main purpose is to facilitate users to use search engines and obtain effective and timely information from search engines in an efficient and multi-faceted manner. The design and implementation of the user interface uses the theory and methods of human-computer interaction to fully adapt to human thinking habits.

　User input interfaces can be divided into simple interfaces and complex interfaces.

　Simple interfaces only provide a text box for users to enter query strings; complex interfaces allow users to restrict queries, such as logical operations (AND, OR, NOT, -), proximity (adjacent, NEAR), domain name range (such as .edu, .com), appearance position (such as title, content), information time, length, etc. Currently, some companies and institutions are considering formulating standards for query options.

IV. Future Trends

　Search engines have become a new research and development field. Because they use theories and technologies from many fields such as information retrieval, artificial intelligence, computer networks, distributed processing, databases, data mining, digital libraries, and natural language processing, they are comprehensive and challenging. And because search engines have a large number of users and good economic value, they have attracted great attention from computer science and information industry circles around the world. Currently, research and development are very active, and many noteworthy trends have emerged.

　1. Pay great attention to improving the accuracy of information query results and improving the effectiveness of retrieval

　When users search for information on search engines, they are not particularly concerned about the number of results returned, but rather whether the results match their needs. For a query, traditional search engines can easily return hundreds of thousands or millions of documents, and users have to filter through the results. There are currently several ways to solve the problem of too many query results: First, various methods are used to obtain the real purpose that users have not expressed in the query statement, including using intelligent agents to track user search behavior and analyze user models; using a relevance feedback mechanism to allow users to tell the search engine which documents are relevant to their needs (and the degree of relevance) and which are not, and gradually refine through multiple interactions. Second, use text categorization technology to categorize the results, and use visualization technology to display the categorization structure, so that users can only browse the categories they are interested in. Third, perform site clustering or content clustering to reduce the total amount of information.

　2. Information filtering and personalized services based on intelligent agents

　Information intelligent agent is another mechanism for utilizing Internet information. It uses automatically acquired domain models (such as Web knowledge, information processing, information resources related to user interests, domain organizational structure) and user models (such as user background, interests, behavior, style) to collect, index, and filter information (including interest filtering and bad information filtering), and automatically submit information that is of interest to and useful to users. Intelligent agents have the ability to continuously learn and adapt to dynamic changes in information and user interests, thereby providing personalized services. Intelligent agents can be run on the user side or on the server side.

　3. Use distributed architecture to improve system scale and performance

　The implementation of search engines can adopt centralized architecture and distributed architecture, and both methods have their own advantages. However, when the system scale reaches a certain level (such as the number of web pages reaches hundreds of millions), it is necessary to adopt a distributed method to improve system performance. All components of the search engine, except the user interface, can be distributed: searchers can cooperate and divide the work on information discovery on multiple machines to improve the speed of information discovery and update; indexers can distribute indexes on different machines to reduce the requirements of indexes on machines; retrievers can perform parallel retrieval of documents on different machines to improve the speed and performance of retrieval.

　4. Emphasis on the research and development of cross-language retrieval

　Cross-language information retrieval means that users submit queries in their native language, and search engines perform information retrieval in databases in multiple languages, returning documents in all languages that can answer the user's questions. If machine translation is added, the returned results can be displayed in the native language. This technology is still in the preliminary research stage, and the main difficulty lies in the uncertainty of expression and semantic correspondence between languages. However, it is undoubtedly of great significance in today's economic globalization and the Internet that transcends national boundaries.

5. Academic Research

　Currently, commercial development in the search engine field is very active. Major search engine companies are investing heavily in the development of search engine systems. At the same time, new search engine products with distinctive characteristics are constantly emerging. Search engines have become one of the industries in the information field. In this context, academic research in the field of search engine technology has received attention from universities and research institutions. For example, Stanford University developed the Google search engine in its digital library project, and conducted in-depth research in the efficient search of Web information, document relevance evaluation, large-scale indexing, etc., and achieved good results.

　Steve Lawrence and C. Lee Giles of NEC America Research Institute reviewed the research on search engine technology in Nature and Science magazines in 1998 and 1999. The famous information retrieval conference TREC also added the Web Track topic in 1998 to examine the differences in retrieval properties between Web documents and other types of documents, and to test the performance of information retrieval algorithms on large-scale Web libraries (such as 100G bytes).

　The International Search Engine Conference, hosted by the American company Infornotics, has been held annually since 1996 to summarize, discuss and look forward to search engine technology. Participants include scholars from famous search engine companies, universities and research institutions, which has played a good role in promoting search engine technology. In addition, more and more articles on search engine technology research have been published in the International World Wide Web Conference and the Human-Computer Interaction Conference hosted by IEEE.
Previous related article: How to read ALEXA report https://bbs.eeworld.com.cn/11089/ShowPost.aspx
Next related article: How often is the PR value updated https://bbs.eeworld.com.cn/11448/ShowPost.aspx