The internet contains a vast collection of information, which is spread out in every part of the world on remote web servers. The problem in locating the correct information on the internet led to the creation of search technology, known as the internet search engine.
A search engine can provide links to relevant information based on your requirement or query. Examples of popular internet search engines are Google and Bing.
A search engine is a computer software, that is continually modified to avail of the latest technologies in order to provide improved search results. Each search engine does the same functions of collecting, organizing, indexing and serving results in its own unique ways, thus employing various algorithms and techniques, which are their trade secrets. In short, the functions of a search engine can be categorized into the following:
- Crawling the internet for web content.
- Indexing the web content.
- Storing the website contents.
- Search algorithms and results.
Crawling is the method of following links on the web to different websites, and gathering the contents of these websites for storage in the search engines databases. Crawling the internet can start with a popular website containing lots of links, such as Yahoo!, or from existing older indexes of websites. The crawler (also known as a web robot or a web spider) is a software program that can download web content (web pages, images, documents and other files), and then follow hyperlinks within these web contents to download the linked contents. The linked contents can be on the same site or on a different website.
Web crawlers are a central part of search engines. They are internet software programs, bots, that systematically browse the World Wide Web, typically for the purpose of web indexing. Web crawlers or web spider, ant, an automatic indexer, or a web scutter, are used to index or update web contents for a specific website.
Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them much more quickly. These crawlers validate hyperlinks, HTML code and can be used for web scraping.
A web crawler starts with a list of URLs to visit, the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit (crawl frontier). From there, the crawlers jumps from one web page to another from URLs gathered from the frontier recursively according to a set of policies:
- Policy that states which pages to index.
- Policy to revisit that states when to check for changes to the a web page.
- Policy to avoid overloading websites.
- Policy that states how to coordinate distributed web crawlers.
The crawling continues until it finds a logical dead end, such dead end, or stop, can be a web page with no links, or reaching the set number of levels inside the website's link structure. If a website is not linked from other websites on the internet, the crawler will be unable to locate it. Therefore, if the website is new, and has no links from other sites, that website has to be submitted to the search engines for crawling.
Web crawlers typically identify themselves to a web server by using the user-agent field of an HTTP request. Website administrators typically examine their web servers' log and use the user agent field to determine which crawlers have visited the web server and how often.
The user agent field may include a URL where the website administrator may find out more information about the crawler. Spambots and other malicious web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.
Web crawler's identity is important so that they can identify themselves to website administrators. And also to enable website administrators to to contact the web crawler's owner if needed. Identification is also useful for website administrators that are interested in knowing when they may expect their website's pages to be indexed.
Efficiency and Guidelines
The efficiency of the crawler makes it crawl multiple websites at the same time, so as to collect billions of website contents as frequently as it can. News and media sites are crawled more frequently by search engines like Google, in order to deliver updated news and content in their search results for its users to find.
The crawler also does not flood a single website with a high volume of requests at the same time, but spreads the crawling over a period of time so that the website that it visits does not crash. Usually search engines crawl only a few (three or four) levels deep from the homepage of a website.
Crawlers or web robots follow guidelines specified for them by the website owner using the robots exclusion protocol (robots.txt). The robots.txt will specify the files or folders that the owner does not want the crawler to index in its database. Many search engine crawlers do not like unfriendly URLs, such as those generated by database driven websites.
These website URLs contain parameters after the question mark. Search engines dislike such URLs because the website can overwhelm the crawler by using parameters to generate thousands of new web pages for indexing with similar content. Thus, crawlers often disregard the changes in the parameters as part of a new URL to spider. Search engine friendly URLs are used to compensate for this problem.