|
The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot. Simply put, when a Robot vists a Web site, say
http://www.nederlandinternet.com/, it firsts checks for http://www.nederlandinternet.com/robots.txt. If it can find this
document, it will analyze its contents for records like:
User-agent: *
Disallow: /
to see if it is allowed to retrieve the document.
The "/robots.txt" file usually contains a record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /images/
In this example, three directories are excluded to all robots. The user-agent tag is where you specify whether all robots (*)
or only specific robots are to follow these instructions.
One thing to remember is that you need a separate "Disallow" line for every URL prefix you want to exclude. You can exclude
entire directories, as shown in the example above, or specific files, as in "Disallow: /images/mypicture.jpg."
Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent
field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow:
*.jpg".
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve.
This brings us to a discussion of META TAGS [more].
|