|
|
 |
About Search Engines - Part II
Published: Friday, August 20, 2004
Indexing the Web Content
Similar to an index of a book, a search engine also extracts and builds a catalog of all the words that appear on each web page and the number of times it appears on that page etc. Indexing of web content is a challenging task assuming an average of 1000 words per web page and billions of such pages. Indexes are used for searching by keywords, therefore, it has to be stored in the memory of computers to provide quick access to the search results.
Indexing starts with parsing the website content using a parser. The parser can extract the relevant information from a web page by excluding certain common words (such as a, an, the - also known as stop words), HTML tags, Java Scripting and other bad characters. A good parser can also eliminate commonly occurring content in the website pages (such as navigation links) so that they are not counted as a part of the page's content.
Once the indexing is completed, the results are stored in memory, in a sorted order. This helps in retrieving the information quickly. Indexes are updated periodically as new content is crawled. Some indexes help create a dictionary (lexicon) of all words that are available for searching. Also a lexicon helps in correcting mistyped words by showing the corrected versions in a search result. A part of the success of the search engine lies in how the indexes are built and used. Various algorithms are used to optimize these indexes so that relevant results are found easily without much computing resource usage.
Storing the Web Content
In addition to indexing the web content, the individual pages are also stored in the search engine's database. Due to cheaper disk storage, the storage capacity of search engines is very huge, and often runs into terabytes of data. However, retrieving this data quickly and efficiently requires special distributed and scalable data storage functionality. The amount of data, that a search engine can store, is limited by the amount of data it can retrieve for search results. Google can index and store about 3 billion web documents. This capacity is far more than any other search engine during this time.
Search Algorithms and Results
Once user enters the search keywords, the search engine's search algorithm looks up the indexes for matches for the search keywords. Once it can match the keywords in the index, the search engine tries to provide the most relevant contents first. This relevance matching is achieved by various search engine algorithms and hence is the bread and butter of search engine's popularity. Among all the search engines on the internet, Google stands out from the rest because it can provide more relevant answers to search queries. The search algorithms, that are used to find the most relevant results from a hay stack of web content, are different from one another. That is why search results, for the same keywords, produces different results on various search engines.
Advanced search engines, like Google, use a relevance ranking system, where each web page is ranked based on various factors such as:- Content analysis : The content of each webpage is evaluated for the keywords based on the number of occurrences, position in the page (such as title, meta tags, heading), font size, proximity between them etc.
- Linking structure : The links from an external page or website to this page are analyzed for keywords in the link structure. Also links from a popular website will lead to a higher ranking.
- Page ranking :This is a relative ranking of a website based on an algorithm that is used specifically by Google. The page rank denotes the ranking of a web page based on its popularity and quality of links, among various other factors. The basic idea behind a higher page rank is that it is easier to find the website on the internet.
Conclusion
The search results decide the fate of a search engine. Different search engines try to cater to different users. AskJeeves is known to be popular because it provides search results based on descriptive question like queries. Its engine is optimized to parse the user friendly search query for keywords, which are then internally used to perform the search. The user feels as if the question was processed by a human behind the computer. Search engine technology is evolving every day and new researches are carried out to provide more concept and descriptive based search queries. However, the same theory applies - "The search engine, which provides the most relevant results, will rule".
|
 |
|
|
 |
|
|
|
|
|
|
Copyright © 2003-2008 WebsiteGear Inc. All rights reserved.
About |
Advertise |
Submit Content |
Privacy |
Agreement |
Contact
|
|
|
|
|