hpricot | JOEZACK.COM

A family members was having a problem with some mixed up image names on a static html site. I could have fixed it manually in a few shakes, but that’s no fun. Instead I used hpricot to scrape, open-uri to test for broken-ness, Find to search and some good old fashion regex to correct.

This was my first time messing around with hpricot and I found it to be powerful and easy to use, two thumbs up. I foresee some scraping and spidering posts in the near future.

On to the code:

My final script was a bit hairy so I broke out the bit I used to find the broken images.

If you run the script it’ll print the offending paths to screen:

ruby image_scanner.rb http://site.com/busted.html

Or you can call the get_broken_images method to get an array back:

require 'image_scanner'
scanner = Image_Scanner.new
broken_images = scanner.get_broken_images "http://site.com/busted.html"

In case you’re interested, I’ve also uploaded the full code that I used to search for and correct the images although it’s implementation specific, riddled with lazy and is poorly tested. Read the disclaimer!

Just run it and be amazed!

ruby image_scanner.rb http://site.com/busted.html /media_folder /busted.html /fixed.html

Download only the broken image scanner
Download the full script

JOEZACK.COM

Code Musings and Such

Tag Archives: hpricot

Scraping and Saving Flickr Images with Ruby

Finding and Fixing Broken Images with Ruby