Like HowStuffWorks on Facebook!

What is a search engine spider?


Search engine spiders, sometimes called crawlers, are used by Internet search engines to collect information about Web sites and individual Web pages. The search engines need information from all the sites and pages; otherwise they wouldn’t know what pages to display in response to a search query or with what priority.

Search engine spiders crawl through the Internet and create queues of Web sites to investigate further. As a specific Web site gets covered by a spider, the spider reads through all the text, hyperlinks, meta tags (meta tags are specifically formatted key words inserted into the Web page in a way designed for the spider to find and use) and code. Using this information, the spider provides a profile to the search engine. The spider then gathers additional information by following the hyperlinks on the Web page, which gives it a better collection of data about those pages. This is the reason that having links on your Web page – and, even better, other Web pages linking to yours -- is so useful in getting your Web site found by the search engines.

Spiders have four basic modes of gathering information. One type of spider is used only to create the queues of Web pages to be searched by other spiders. This spider, working in “selection” mode, is prioritizing which pages to go through and checking to see if an earlier version of a page has already been downloaded. The second mode is a spider designed especially to go over pages that have already been crawled by a spider. This mode is called “re-visitation.” Some search engines are concerned that a page has been too thoroughly crawled by other spiders, so they use a spider mode called “politeness,” which limits crawling overworked pages. Lastly, “parallelization” allows a spider to coordinate its data collection efforts with other search engine spiders that are crawling over the same page.


More to Explore