Tag Archives: hpricot

Scraping and Saving Flickr Images with Ruby

I’ve been playing around a bit more with the black arts of spidering and scraping in Ruby and I’m still amazed by how easy it is to do. For fun I whipped up a little script that will spider a Flickr photostream and download all the images.

Flickr provides a wonderful api and there’s even a great Ruby interface for it, so this script is entirely futile. But it was fun and educational.

Usage

ruby init.rb yourusername /location/to/save

Download it!

Finding and Fixing Broken Images with Ruby

A family members was having a problem with some mixed up image names on a static html site. I could have fixed it manually in a few shakes, but that’s no fun. Instead I used hpricot to scrape, open-uri to test for broken-ness, Find to search and some good old fashion regex to correct.

This was my first time messing around with hpricot and I found it to be powerful and easy to use, two thumbs up. I foresee some scraping and spidering posts in the near future.

On to the code:

My final script was a bit hairy so I broke out the bit I used to find the broken images.

If you run the script it’ll print the offending paths to screen:

ruby image_scanner.rb http://site.com/busted.html

Or you can call the get_broken_images method to get an array back:

require 'image_scanner'
scanner = Image_Scanner.new
broken_images = scanner.get_broken_images "http://site.com/busted.html"

In case you’re interested, I’ve also uploaded the full code that I used to search for and correct the images although it’s implementation specific, riddled with lazy and is poorly tested. Read the disclaimer!

Just run it and be amazed!

ruby image_scanner.rb http://site.com/busted.html /media_folder /busted.html /fixed.html

Download only the broken image scanner
Download the full script