Understanding the Robots File

Search engines have what they call spiders or web crawlers. These computer programs browse the World Wide Web in a methodical, automated manner or in an orderly fashion, commonly known as web crawling or spidering.

Search engines use their spiders or web crawlers to visit different websites and index their contents. But often there are cases when indexing parts of your online content is not what you want. Or if you have sensitive data on your site that you do not want the search engines to index and expose it to the world to see. Additionally, if you want to save some bandwidth by excluding images or stylesheets from being indexed.

One way to tell search engines which files and folders on a website index and which to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed, making it not an efficient way to tell the crawlers what to do. A better way to inform search engines about your will is to use a robots file.

The Robots File

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt is a text file you can put on your site to tell search engine's crawlers which pages you don't want them them not to visit and index. It's a protocol advising web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.

It was first proposed by by Martijn Koster when working for Nexor in February 1994. Charles Stross claims to have provoked Koster to suggest robots.txt after he wrote a badly-behaved web spider that caused an inadvertent denial of service attack on Koster's server. It quickly became a de facto standard that present and future web crawlers were expected to follow.

The Robots file is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site.

The robots.txt should be located in the main directory (http://www.yoursite.com/robots.txt) because user agents (search engines) will not be able to find it. The text file must be located in the main directory of a website because otherwise the user agents (search engine's crawlers) will not be able to find it. User agents do not search the whole site for a file named robots.txt. Instead, they look first in the main directory. And if these agents could not find it there, they simply assume that this site does not have a robots.txt file and therefore they may index everything they find along the way.

Structure

The structure of a robots.txt is simple. This file can have an endless list of user agents and disallowed files and directories. Basically, the syntax is as follows:

User-agent:

Disallow:

"User-agent" are search engines' crawlers. And "Disallow" is the lists of files and directories to be excluded from indexing. In addition to "user-agent" and "disallow:" entries, comment lines ('#' sign) at the beginning of the line can also be included.

Examples

Allowing Robots to index all files:User-agent: * Disallow:

Not allowing Robots to index any files:User-agent: * Disallow: /

Not allowing Robots to index a specific file:User-agent: * Disallow: /directory/example.html

Several major search engine crawlers support a Crawl-delay parameter (Crawl-delay), set to the number of seconds to wait between successive requests to the same server. Some major crawlers also support an Allow directive (Allow) which can counteract a following Disallow directive (Disallow).

Common Problems

When you start making complicated commands on the Robots file, like for example allowing different user agents to different directories and files, problems may start if you don not pay attention.

Common mistakes include misspelled typos: user agents, directories, missing colons, etc. Typos can be tricky to find but in some cases validation tools can help.

The more serious problem is with logical errors.

Tools for the Robots

Simple syntax in a robots.txt file means that you can always read it to see if everything is done as it is supposed to. But if you want to make things easier, you can use a validator. Validators are tools that are able to report common mistakes in syntax.

When you have and/or need a complex robots.txt file, there are tools that will generate the file for you. There are also visual tools that allow to point and select which files and folders are to be excluded.