Skip to Content

The author

Malcolm Slade

Head of Technical SEO

As mentioned in a previous post (SEO Basics – Duplicate Content), search engines are constantly trying to provide a more comprehensive index with less duplication. It makes sense that if you remove 5 pages of identical content you have some room for the 2 new original pages from another site. Search engine indexes are after all limited in size.

The Problem Another major cause of duplicate content is the use of session IDs within a website. Session IDs are often used with e-commerce sites to track the user through his or her journey around the site. What basically happens is, when a user browses to the site a unique ID is generated and attached to the root URL. The user is then forwarded to this new URL and the unique ID is added to the internal links on this page and subsequent pages. The issue is each new visit creates a new set of URLs for the site all with duplicate content is not the same page as If each time a search engine spider visits the site it finds a new set of URL you are asking for trouble. Sometimes they will index these new URLs and cause duplicate content issues, others they will simply not index any pages. The Solution “So what can we do?†you may ask. Well there are several solutions some more drastic than others. Firstly you could stop using session IDs and switch to cookies. This however has its own set of problems relating to security settings etc. Alternatively you could embed a form into your pages with hidden fields that carry the session information. This can be a bit tricky to get right. The best solution however seems to be as follows. The user agent that each visitor uses can be identified via details sent to the webserver. This can then be compared to a list of known agents used by search engine spiders and if found to be one 301 redirected to a clean not session IDed version of the site. “Googlebot†is the Google spiders reported agent, “MSNBOT†is MSNs and “Slurp†is Yahoos. If implemented correctly, search engine robots should always be given the same set of URLs and never spider your session IDed site. Sample Pseudo Code

  • Capture user agent data
  • Assign user agent data to variable A
  • Compare variable A to list of known user agents

If true

  • Perform required action

If false

  • Continue as normal