Skip to Content

The author

Malcolm Slade

Head of Technical SEO

Most people know what a robots.txt file is and how to implement you basic User-agent: * Disallow: restrictions. But there is a lot more to robots.txt than simply blocking a folder.

Most people know what a robots.txt file is and how to implement you basic User-agent: * Disallow: restrictions. But there is a lot more to robots.txt than simply blocking a folder.

The Basic Syntax Your instructions within your robots.txt come in blocks. Each block starts with a declaration of the particular robot or robots it will apply to via their user agent (all nice robots should declare themselves via a user agent so you can handle them appropriately). So the start of a block would look like; User-agent: Googlebot the Google robot User-agent: msnbot the Live and Bing robot User-agent: * all robots (or at least all of them that obey) After your user agent line come your actions to be followed. These need to come directly below with no blank lines. There can be as many as you like as long as each is on a new line. So with instructions your block might look like; User-agent: Googlebot Disallow: /supersecretfolder/ Disallow: /worlddominationplans/ Disallow: /poemsaboutunicorns/ Comments can be added into your robots.txt file by starting them with a hash (#) e.g. # This is a comment. Folder, filenames etc. are case sensitive as it should be... sorry rant over. Instructions are handled in order which can lead to unexpected behaviour. For instance; User-agent: msnbot Disallow: User-agent: * Disallow: /supersecretfolder/ Would not prevent msnbot from looking in the folder as you have already given it the key to the city in your first block.

More Advanced Usage

Noindex Disallowing a folder or file doesn’t necessarily keep it out of the index. It simply prevents the search engine robots from grabbing the content and adding it to their index. If somebody links to your content its location will still appear in the search engine index and relevance will be gleamed from the anchor text etc. The snippet will simply be the URL only with no other information. To prevent your folder or file from being indexed completely you need to use the Noindex directive although this is only obeyed by Google. User-agent: Googlebot Noindex: /supersecretfolder/ Allow If you have a large amount of content you don’t want crawled but somewhere in there, there is a piece you do want crawled, you can use the Allow directive. This directive supported by Google and Yahoo means you can override part of a previous disallow statement e.g. Disallow: /poemsaboutunicorns/ blocks everything in that folder Allow: /poemsaboutunicorns/notwrittenbyme/ opens up the child folder This is a whole lot easier than writing individual lines for each of the child folders. Pattern Matching One of the most useful things you can do within your robots.txt file is pattern match. Let’s say you have a large number of pages indexed with random variables attached, all of which are creating duplicates of your homepage and you lack the ability to apply 301s etc. You can apply a disallow statement that uses pattern matching to say “if it’s got a variable tagged on the end, don’t crawl it”. So let’s say we find a load of duplicates with the variable affid= tagged on. User-agent: * Disallow: /*?affid=* Would handle them all. You can also go further and block a particular file extension or even any URL which uses a variable. Disallow: /*.cfm$ blocks all cfm extension pages Disallow: /*?* blocks all pages that use a variable This can be very powerful but remember that by disallowing, you are in effect causing the pages to be ignored so no links inside will be followed and any link juice flowing to the pages will meet a dead-end. Handling HTTPs If you have the issue of all your pages being indexed under both HTTPs and HTTP protocols you can tackle the issue using robots.txt with a little help from URL rewriting. First you need to set up 2 robots.txt files, one that handles your HTTP pages, named robots.txt and another that handles your HTTPs pages names robots2.txt. Maybe; User-agent: * Disallow: ----- User-agent: * Disallow: / Noindex: / You then set up a rule in your URL rewriting to say if the request is form robots.txt using the HTTPs protocol, serve the robots2.txt file. Something like; RewriteCond %{SERVER_PORT} 80 RewriteRule ^/robots\.txt$ /robots2.txt [NC] Sitemaps Although I would always recommend linking to your XML sitemap from your HTML sitemap, you should also specify your XML sitemap location within your robots.txt file. The format is simply a clean line containing Sitemap: the location e.g. Sitemap: So, that’s robots.txt. You can find out a lot more at and if you know any tricks I have missed, please don’t hesitate to let me know via a comment.