A search engine crawler is a program or automated script that browses the World Wide Web in a methodical manner in order to provide up to date data to the particular search engine. While search engine crawlers go by many different names, such as web spiders and automatic indexers, the job of the search engine crawler is still the same. The process of web crawling involves a set of website URLs that need to be visited, called seeds, and then the search engine crawler visits each web page and identifies all the hyperlinks on the page, adding them to the list of places to crawl. URLs from this list are re-visited occasionally according to the policies in place for the search engine. The policies of the search engine can be different for each search engine, and may be a cautionary action to ensure that some of the pages that have been added to the index before have not become spam.
Search engine crawlers have a hard times crawling the web on occasion because the Internet has three main characteristics that make it harder to continually keep the index up to date. Because of the large volume of web pages on the Internet, the fast pace and frequency of change to the pages, and the addition of dynamic pages, many search engine crawlers have a hard time crawling. These variations produce a massive amount of URLs to crawl, and cause the search engine crawler to prioritize certain web pages and hyperlinks. This prioritization can be summed up in four different search engine crawler policies that are found commonly within all search engines, though they might differ slightly.
The selection policy is the policy that states which pages to download for the crawling.
The re-visit policy type is a policy that indicates to a search engine crawler when to check web pages for changes
The politeness policies are used to inform crawlers as to how to avoid overloading websites to check the URLs
The parallelization policy is a policy which states how to coordinate distributed web crawlers
Search engine crawlers generally not only have a good crawling strategy with the policies that allow it to narrow down and prioritize the web pages that need to be crawled, but also need to have a highly optimized architecture. This architecture is used build high-performance systems for search engines that are capable of downloading hundreds of millions of pages over several weeks. This architecture can be followed easily, but must also be ready for high performance results. In a well formed search engine crawler, the web page is taken from the World Wide Web and put through a multi-threaded downloader. The URLs from this multi-threaded downloader head into a queue, and then pass through a scheduler to prioritize the URLs, finally going through the multi-threaded downloader again so that the text and Meta data ends up in storage.
There are many different professional search engine crawlers available today, such as the Google Crawler, and are used to list the URLs for use in the search engine. Without search engine crawlers, there would be no results for search engine results pages, and new pages would never be listed.