There is one other further use of meta tags that is of immediate interest, and that is using them to controlling the behaviour of search engine spiders. At the level of the individual page, meta tags with the
name attribute of
robots can be used with
content values of the following:
|Do not index the content of this page (i.e. do not add it to a search engine database, and do not show it in search results.|
|Do not follow outbound links from this page.|
Together, they are typically used as follows:
<meta name="robots" content="noindex, nofollow" />
This can be used to hide information that is publicly available, but that you do not wish to appear in search results: personal information you have no problem in sharing if someone comes across it, but do not want accessible to Google, for example.
(The opposite commands,
follow, are assumed – they do not have to be placed on a page in order to have it searched and indexed.)
To control access to many pages at a time, or to folders, a different approach is used. Before they attempt to index a site, all search engine robots will look for a file called robots.txt at the root of the site (i.e. alongside the index.html page). The file will usually contain two very simple lines. The first is:
User-agent is the name of the spider that you wish to control: Google’s, for example, is called googlebot. In this way you can command different spiders from different search engines to do different things. Typically, however, you want to command all spiders to do the same thing, and so use a wildcard:
This means “the following command is true for all spiders”.
The next line is the actual command to the spider. The most common is:
Meaning: do not index the file specified after the colon, or files in the specified directory. To disallow everything on your site from being indexed, use /.