I’ve been playing around a bit more with the black arts of spidering and scraping in Ruby and I’m still amazed by how easy it is to do. For fun I whipped up a little script that will spider a Flickr photostream and download all the images.
Flickr provides a wonderful api and there’s even a great Ruby interface for it, so this script is entirely futile. But it was fun and educational.
ruby init.rb yourusername /location/to/save
A family members was having a problem with some mixed up image names on a static html site. I could have fixed it manually in a few shakes, but that’s no fun. Instead I used hpricot to scrape, open-uri to test for broken-ness, Find to search and some good old fashion regex to correct.
This was my first time messing around with hpricot and I found it to be powerful and easy to use, two thumbs up. I foresee some scraping and spidering posts in the near future.
On to the code:
My final script was a bit hairy so I broke out the bit I used to find the broken images.
If you run the script it’ll print the offending paths to screen:
ruby image_scanner.rb http://site.com/busted.html
Or you can call the get_broken_images method to get an array back:
scanner = Image_Scanner.new
broken_images = scanner.get_broken_images "http://site.com/busted.html"
In case you’re interested, I’ve also uploaded the full code that I used to search for and correct the images although it’s implementation specific, riddled with lazy and is poorly tested. Read the disclaimer!
Just run it and be amazed!
ruby image_scanner.rb http://site.com/busted.html /media_folder /busted.html /fixed.html
Download only the broken image scanner
Download the full script