(wget) Robot Exclusion
Info Catalog
(wget) Appendices
(wget) Security Considerations
9.1 Robot Exclusion
===================
It is extremely easy to make Wget wander aimlessly around a web site,
sucking all the available data in progress. `wget -r SITE', and you're
set. Great? Not for the server admin.
As long as Wget is only retrieving static pages, and doing it at a
reasonable rate (see the `--wait' option), there's not much of a
problem. The trouble is that Wget can't tell the difference between the
smallest static page and the most demanding CGI. A site I know has a
section handled by a CGI Perl script that converts Info files to HTML on
the fly. The script is slow, but works well enough for human users
viewing an occasional Info file. However, when someone's recursive Wget
download stumbles upon the index page that links to all the Info files
through the script, the system is brought to its knees without providing
anything useful to the user (This task of converting Info files could be
done locally and access to Info documentation for all installed GNU
software on a system is available from the `info' command).
To avoid this kind of accident, as well as to preserve privacy for
documents that need to be protected from well-behaved robots, the
concept of "robot exclusion" was invented. The idea is that the server
administrators and document authors can specify which portions of the
site they wish to protect from robots and those they will permit access.
The most popular mechanism, and the de facto standard supported by
all the major robots, is the "Robots Exclusion Standard" (RES) written
by Martijn Koster et al. in 1994. It specifies the format of a text
file containing directives that instruct the robots which URL paths to
avoid. To be found by the robots, the specifications must be placed in
`/robots.txt' in the server root, which the robots are expected to
download and parse.
Although Wget is not a web robot in the strictest sense of the word,
it can downloads large parts of the site without the user's
intervention to download an individual page. Because of that, Wget
honors RES when downloading recursively. For instance, when you issue:
wget -r http://www.server.com/
First the index of `www.server.com' will be downloaded. If Wget
finds that it wants to download more documents from that server, it will
request `http://www.server.com/robots.txt' and, if found, use it for
further downloads. `robots.txt' is loaded only once per each server.
Until version 1.8, Wget supported the first version of the standard,
written by Martijn Koster in 1994 and available at
`http://www.robotstxt.org/wc/norobots.html'. As of version 1.8, Wget
has supported the additional directives specified in the internet draft
`<draft-koster-robots-00.txt>' titled "A Method for Web Robots
Control". The draft, which has as far as I know never made to an RFC,
is available at `http://www.robotstxt.org/wc/norobots-rfc.txt'.
This manual no longer includes the text of the Robot Exclusion
Standard.
The second, less known mechanism, enables the author of an individual
document to specify whether they want the links from the file to be
followed by a robot. This is achieved using the `META' tag, like this:
<meta name="robots" content="nofollow">
This is explained in some detail at
`http://www.robotstxt.org/wc/meta-user.html'. Wget supports this
method of robot exclusion in addition to the usual `/robots.txt'
exclusion.
If you know what you are doing and really really wish to turn off the
robot exclusion, set the `robots' variable to `off' in your `.wgetrc'.
You can achieve the same effect from the command line using the `-e'
switch, e.g. `wget -e robots=off URL...'.
Info Catalog
(wget) Appendices
(wget) Security Considerations
automatically generated by
info2html