Skip to Content

Programmatic Web Browsing and Healthchecking with Mechanize.

The author

Epiphany Search

A quick little golden nugget for all you automated site scrapers out there, check out Mechanize! Mechanize is a completely programmable browser implemented (in this instance) purely in Python. This tool is priceless for automated site health checks for everything from link presence, robots.txt rules, automated form submissions, cookie handling, you name it.

A quick little golden nugget for all you automated site scrapers out there, check out Mechanize! Mechanize is a completely programmable browser implemented (in this instance) purely in Python. This tool is priceless for automated site health checks for everything from link presence, robots.txt rules, automated form submissions, cookie handling, you name it.

Whereas mostly we would parse a page as text and look for links, then write our own logic to follow that link and pull out information about the page, mechanize allows you to do all this at a much higher level. Here is just one neat trick: # Make a browser. browser = mechanize.Browser() # Open a page. browser.open("http://www.epiphanysearch.co.uk/") # Follow the second link for the keyword 'analytics'. myResponse = browser.follow_link(text_regex=r"analytics", nr=1) # Print the title of the page. assert browser.viewing_html() print browser.title() ...returns "Google Analytics IQ Consultants - Google Analytics Consulting Services", the title of the page linked to by the second 'analytics' link. The browser also has a wealth of other features such as robots.txt handling, proxy handling and redirect detection. Not only does this make short work of automated sitemap building, checking for the presence and sanity of acquired links and ensuring our pages aren't blocked by robots.txt, but with a little extra work it can submit to forms, handle cookies and also handle various types of errors and exceptions making it invaluable for just throwing onto a regular scheduled task for the daily weeding out of quirks and problems which manual surfing may not immediately uncover. I'm aware I'm in a bit of a niche here but if you write programs to programmatically interact with websites, give it a whirl! You'll not be disappointed.