Skip to Content

The author

Epiphany Search

<strong>What is the robots.txt file?</strong> The robots.txt file is an ascii text file that sits in the root directory of where your site is hosted. It is the first place every robot visits and gives you the ability to control which robots you allow to crawl your site and where they are allowed to crawl. The robots.txt file contains two text fields:

What is the robots.txt file? The robots.txt file is an ascii text file that sits in the root directory of where your site is hosted. It is the first place every robot visits and gives you the ability to control which robots you allow to crawl your site and where they are allowed to crawl. The robots.txt file contains two text fields:

User-agent: * Disallow: The User-agent field is for specifying which robot you are referring to and the Disallow field relates to where that particular robots is restricted from crawling. An example : User-agent: * Disallow: / Here “*†means all robots and “/ †means all URLs. This is read by the robots as, “No access for any robot to any URL†Since all URLs are preceded by “/ â€, this bans access to all URLs when nothing follows after “/ â€. If you want to allow access to most area's but just restrict certain parts of your site then we can specify that as follows: User-agent: Googlebot Disallow: User-agent: * Disallow: /products/prices/ In this robots.txt file I have repeated both of the two fields we discussed. Multiple commands can be given for different user agents in different lines. The above commands mean that all user agents are banned access to /products/prices/ , because we might want to hide this from appearing in search engines. It allows access to every section to Googlebot, Googles robot. Working with the robots.txt file: 1. The robots.txt file is always named in all lowercase e.g. robots.txt. 2. The robots.txt file is an exclusion file meant for search engine robot reference and not obligatory for a website to function. An empty or absent file simply means that all robots are welcome to index any part of the website but it is regarded as best practice to have one, even it it just allows all robots into all area's. 3. Only one file can be maintained per domain. 4. Website owners who do not have administrative rights cannot sometimes make a robots.txt file. In such situations, the Robots Meta Tag can be configured which will solve the same purpose. Here we must keep in mind that lately, questions have been raised about robot behavior regarding the Robot Meta Tag. Some robots might skip it altogether. Protocol makes it obligatory for all robots to start with the robots.txt thereby making it the default starting point for all robots. There are some handy features to the robot meta tag, for instance, you can specify how often a robot should return to your site in number of days. 5. Separate lines are required for specifying access to different user agents and Disallow field should not carry more than one command in a line in the robots.txt file. There is no limit to the number of lines though i.e. both the User-agent and Disallow fields can be repeated with different commands any number of times. 6. Use lower-case for all robots.txt file content. Advantages of the robots.txt file: Protocol demands that all search engine robots start with the robots.txt file. This is the default entry point for robots if the file is present. Specific instructions can be placed on this file to help index your site on the web. Major search engines will never violate the Standard for Robots Exclusion. 1. The robots.txt file can be used to keep out unwanted robots like email retrievers, image strippers etc. who use your email address to spam and send malicious content. 2. The robots.txt file can be used to specify the directories on your server that you don’t want robots to access and/or index e.g. Temporary, cgi, and private/back-end directories. Search engines are that powerful, that a well structured search term could bring back information that could lead to security risks for your site. It is better to deny access to these areas. 3. An absent robots.txt file could generate a 404 error and redirect the robot to your default 404 error page. The robot could then be left with no link to follow and stop crawling. 4. The need for the robots.txt file was also felt to stop robots from deluging servers with rapid-fire requests or re-indexing the same files repeatedly. There are areas of information on most search engine question areas that give specific information about controlling the frequency and speed of their bot crawling your site, if you believe it is eating up bandwidth and over loading your server.