Facebook could own the word ‘face’
High Street Optimisation – the new Search Engine Optimisation?
Hide Your Sensitive Files from Prying Eyes, and Google!.
06 Dec 2010
There I was, routinely doing some backlink analysis on a client when I came across a very peculiar link. Upon further investigation the link was originating from a Microsoft Word document that had been indexed. I downloaded the document to find that it was a full spec for a brand new website for one of my client’s main competitors! The link came from a section in the document that was asking the developers for a copy of my client’s site navigation structure.
Imitation is the sincerest form of flattery and all that, but is this really the sort of information you want your competitors finding out about? After all, we are now able to position ourselves one step ahead of their new site launch! Now, I won’t name any names, however the document was hosted on the server of a well-known web development company in the UK, presumably within some sort of online client portal, however they had forgotten to restrict access to this particular directory. This therefore enabled the document to become indexed, as well as allowing nosey parkers like myself to download it and read its sensitive contents.
So how should you secure your private files?
There are a number of ways to secure areas of your website, all of which should really be common knowledge to developers and SEO’s alike, however all too often get overlooked in the rush to launch. Therefore most of the following may well be ‘back to basics’ for most of you reading this, however they are all points that need considering at both the build stage of a website and its on-going evolvement. 1) Robots Meta Tag 2) Robots.txt 3) .htaccess - IP Restriction - Password Protection - Preventing Directory Listings Robots Meta Tag The robots.txt is the very lowest form of protection, stopping only the indexing of a webpage page. Implementing the Meta tag can tell search engine spiders to see the page, however not to index it and so will therefore not appear in search results. It won’t of course prevent users from arriving on the page; either via onsite navigation or by the URL directly. Prevent search engines from indexing the page: <META CONTENT="NOINDEX, FOLLOW"> Prevent search engines from both indexing the page & navigating to other pages: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> Robots.txt The robots.txt is very similar in function to the Robots Meta Tag however can prevent search engine spiders from accessing whole directories of a website and their associated files, or indeed the whole website. A plain text file needs to created and named Robots.txt. Within this file a series of directives can be listed to control which directories a search engine is allowed to see. Once again however, this is for search engine spiders only and so excluded directories remain fully accessible to humans. Prevent search engines from accessing the whole server: User-agent: * Disallow: / Prevent search engines from accessing the directory called “secret”: User-agent: * Disallow: /secret/ Please note that each port of your site must have its own robots.txt file. If for example you wish to let a search engine index http:// pages but not secure https:// pages, then you would need: For http:// pages (http://www.yoursite.co.uk/robots.txt): User-agent: * Allow: / For https:// pages (https://www.yoursite.co.uk/robots.txt): User-agent: * Disallow: / .htaccess This file is extremely versatile and is the strongest line of defence in preventing prying eyes seeing your private files and web pages. This file is obeyed by both search engines and your server, and so will protect your sensitive material against both search engines and humans. Some of the best uses of the .htaccess file are: - IP Restriction Using the .htaccess file you can restrict access to only a particular IP address or range of IP addresses. This can be used so that parts of your website are only accessed, for example from your place of work, and allows very tightly controlled viewing permissions to be granted. As both the Google spider and human users will be arriving from different IP addresses, they will not be able to access the directories that you restrict. Include the following code in the .htaccess file to allow access to IP addresses starting with 192.123.xxx.xxx:
<limit GET> order deny,allow deny from all allow from 192.123 </limit>
- Password Protection of Directories In conjunction with a .htpasswd file, the .htaccess file can be set to only grant access to specified directories of a site after a valid username and password have been entered. Once again, this is a restriction to both search engines and humans. As there are several ways to create and encrypt a .htpasswd file, I won’t explain it here, however you can check out this guide for steps in creating it manually. - Preventing Directory Listings This is probably one of the most overlooked issues in regards to unwanted access and is how I was able to find the new website spec for my client’s competitor. A list of a website’s directory structure can sometimes be accessed from a URL where no default file is present to be loaded (i.e. index.html). This therefore allows full access to all of the files and subsequent directories of that website is not correctly managed. Include the following code in the .htaccess file to prevent the access of directory listings:
You now have no excuse for allowing accidental access to sensitive files on your server to either search engines or humans again! Incidentally, I contacted the web development agency of whom leaked the website spec of my client’s competitor and they have since fixed the issue…