The author

Epiphany Search

So, keyword density.. a slightly dated metric but an interesting challenge none the less. Here is my response to Drews first challenge, for this challenge we had to take a random URL and parse it’s text content to check for keyword density.

So, keyword density.. a slightly dated metric but an interesting challenge none the less. Here is my response to Drews first challenge, for this challenge we had to take a random URL and parse it’s text content to check for keyword density.

Additional parameters to make the tool a little more useful were a ‘minimum word length’ and a ‘minimum word occurrences’ parameter. These helped cut down the chaff from the real density results. As usual these scripts are a bit of fun and a proof of concept rather than a robust and useable utility so be ready to tweak it to meet your needs! I ran the solution using Ruby 1.9.2 but the library includes are minimal so you should be able to get it going on most earlier versions. Enjoy! [code lang="ruby"] require 'open-uri' URL_TO_PARSE = ARGV[0] MINIMUM_OCCURANCES = ARGV[1].to_i MINIMUM_WORD_LENGTH = ARGV[2].to_i # list of stopwords STOP_WORDS = ["a","able","about","above","abroad","according"] # etc # read the page in puts "\n\nOpening #{URL_TO_PARSE}..." page = open(URL_TO_PARSE).read() # First remove the script tags... puts "Removing script content..." page.gsub!(/<script.*?>[\s\S]*?<\/script>/i, "") # then remove the markup... puts "Removing mark-up..." page.gsub!(/<\/?[^>]*>/, " ") # trim the whitespace off the start and end of the lines... puts "Tidying text..." page.gsub!(/^[ \t]+|[ \t]+\$/, " ") # and the excess newlines... page.gsub!(/\n{2,}/,"\n") # then pull it all onto one line... page.gsub!("\n"," ") # now, strip out all punctuation... puts "Removing punctuation..." page.gsub!(/[\.|\,|\@|\!|\?|\-|\'|$$|$$]/, '') # remove the excess spaces... puts "Further tidying..." page.gsub!(/ {2,}/," ") # drop everything to the same case page.downcase! # now split it the string using spaces... page_words = page.split(" ") # how many words do we have? puts "Found #{page_words.length} words, removing all words less than #{MINIMUM_WORD_LENGTH} characters in length." # remove entries which have fewer letters than our parameter page_words.delete_if{|word| word.length < MINIMUM_WORD_LENGTH} # how many did we end up with? puts "Ended up with #{page_words.length} words." # remove the stop words from the list puts "Removing stop words from list..." cleaned_page_words = page_words - STOP_WORDS puts "Ended up with #{cleaned_page_words.length} words." # create a hash to hold the words and number of occurances words_and_occurances = {} # go through all the cleaned words creating a hash for each, start with no instances, just to get the keys. cleaned_page_words.each{|word| words_and_occurances[word] = 0} # then go through again adding up the number of occurances cleaned_page_words.each{|word| words_and_occurances[word] +=1 } # sort the list by number of occurances puts "Sorting by frequency..." sorted_words_and_occurances = words_and_occurances.sort_by{|k,v| -v } # output the end result... puts "Filtering words that occur less than #{MINIMUM_OCCURANCES} times.\n\n" puts "Here's your words...\n\n" sorted_words_and_occurances.each do |word,occurances| if occurances > MINIMUM_OCCURANCES then puts "#{occurances} occurances of #{word}" end end [/code] And when we run the script on our homepage, looking for words that appear at least 3 times and have a minimum length of 4 letters... [code lang="text"] C:\path\to\script>ruby keyword_density.rb http://www.epiphanysearch.co.uk 3 4 Opening http://www.epiphanysearch.co.uk... Removing script content... Removing mark-up... Tidying text... Removing punctuation... Further tidying... Found 421 words, removing all words less than 4 characters in length. Ended up with 289 words. Removing stop words from list... Ended up with 240 words. Sorting by frequency... Filtering words that occur less than 3 times. Here are your most frequently occuring words... 14 occurrences of search 5 occurrences of google 5 occurrences of conversion 5 occurrences of media 5 occurrences of agency 5 occurrences of social 4 occurrences of marketing 4 occurrences of optimisation 3 occurrences of partner 3 occurrences of december 3 occurrences of london 3 occurrences of certified 3 occurrences of development 3 occurrences of leeds 3 occurrences of 2011 3 occurrences of paid 3 occurrences of organic 3 occurrences of dont 3 occurrences of analytics 3 occurrences of contact 3 occurrences of clients [/code]

##### latest articles

Let's work together