Every time one surfs the net, the very next page that one would probably open is www.google.com , the search engine. Search engines are highly popular among Internet users. Searching the Internet is one of the earliest activities people try when they first start using the Internet, and most users quickly feel comfortable with the act of searching. Users paint a very rosy picture of their online search experiences. They are happy with the results they find; again, nearly all report that they are usually successful in finding what they're looking for. And searchers are very trusting of search engines, the vast majority declaring that search engines are a fair and unbiased source of information and feel confidence in their searching skill.
Technically, a search engine is the software and algorithms used to perform a search for data based on criteria. A search engine can provide links to relevant information based on your requirement or query. Whenever one comes across new thing and has a quest to learn about that, the foremost, reliable and easiest way is to turn on to search engine. However, most Internet users re naïve about search engine, how it works and different availability of search engine.
How search engine works
A search engine operates, in the following order
Web Crawling
Indexing
Storing
Searching
Crawling is the method of following links on the web to different websites, and gathering the contents of these websites for storage in the search engines databases. This is done by a web crawler (sometimes also known as a web spider or web robot ) — an automated web browser which follows every link it sees. Usually search engines crawl only a few (three or four) levels deep from the homepage of a website. The term deep crawl is used to denote that the crawler or spider can index pages that are many levels deep. Google is an example of a deep crawler. Crawlers or web robots follow guidelines specified for them by the website owner using the robots exclusion protocol ( robots.txt ). The robots.txt will specify the files or folders that the owner does not want the crawler to index in its database.
The contents of each page are then analyzed to determine how it should be indexed. Similar to an index of a book, a search engine also extracts and builds a catalog of all the words that appear on each web page and the number of times it appears on that page etc. Indexes are used for searching by keywords ; therefore, it has to be stored in the memory of computers to provide quick access to the search results.
Indexing starts with parsing the website content using a parser. The parser can extract the relevant information from a web page by excluding certain common words (such as a, an, the - also known as stop words), HTML tags, Java Scripting and other bad characters. A good parser can also eliminate commonly occurring content in the website pages (such as navigation links) so that they are not counted as a part of the page's content.
Once the indexing is completed, the results are stored in an index database for use in later queries. Due to cheaper disk storage, the storage capacity of search engines is very huge, and often runs into terabytes of data. However, retrieving this data quickly and efficiently requires special distributed and scalable data storage functionality. Indexes are updated periodically as new content is crawled. Some indexes help create a dictionary ( lexicon ) of all words that are available for searching. Also a lexicon helps in correcting mistyped words by showing the corrected versions in a search result. A part of the success of the search engine lies in how the indexes are built and used. Various algorithms are used to optimize these indexes so that relevant results are found easily without much computing resource usage
In addition to indexing the web content, some search engines such as Google, store all or part of the source page (referred to as a cache ) as well as information about the web pages, whereas some store every word of every page it finds, such as Alta Vista. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot , and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned web page. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.