A family members was having a problem with some mixed up image names on a static html site. I could have fixed it manually in a few shakes, but that’s no fun. Instead I used hpricot to scrape, open-uri to test for broken-ness, Find to search and some good old fashion regex to correct.
This was my first time messing around with hpricot and I found it to be powerful and easy to use, two thumbs up. I foresee some scraping and spidering posts in the near future.
On to the code:
My final script was a bit hairy so I broke out the bit I used to find the broken images.
If you run the script it’ll print the offending paths to screen:
ruby image_scanner.rb http://site.com/busted.html
Or you can call the get_broken_images method to get an array back:
require 'image_scanner' scanner = Image_Scanner.new broken_images = scanner.get_broken_images "http://site.com/busted.html"
In case you’re interested, I’ve also uploaded the full code that I used to search for and correct the images although it’s implementation specific, riddled with lazy and is poorly tested. Read the disclaimer!
Just run it and be amazed!
ruby image_scanner.rb http://site.com/busted.html /media_folder /busted.html /fixed.html
Download only the broken image scanner
Download the full script