The Accidental Rubyist

invalid byte sequence in UTF-8

Extracting data with Hpricot – Night 1

leave a comment »

I saw the tutorial on scraping gmail with Hpricot and Mechanize, and thought I’d try it. Strangely, it works from irb, but gives errors when run as a ruby program. Mainly, page does not have a method names uri. And page.class gives String.
Also, the url of the actual email has two “?” in it. Had to remove one.

The example did not state that “email” variable extracted only gives links present in the emails. It took me hours of pouring through the resultant (huge) html to get this. (Dumb me!). So I should have used open-uri to read the email. By then I remembered that I already have fetchmail reading my new mail using IMAP and I really don’t need this program.

So I moved my attention to a larger problem at hand. I read more tutorials such as this, which referred me to the priceless Firebug addon which really helps navigate complex htmls. My problem is: each morning fetch the latest tennis scores in a simple format and email me. I always forget to open that page and miss out on news. The page is awfully complex. Each line has umpteen tables and td’s and most have no classname.

After much struggle, I went to plan B. I found another page that had all the results in a simple concize form (single line for each match) on yahoo. Used links to dump the output, used sed to filter out and mail me the page.
Aah, i had a script running in minutes, and i could even raise an alarm if my fav players had played using good ole grep.
The script (source) boils down to these 3 lines:
links -dump "http://sports.yahoo.com/ten/matches" > match.txt
sed -n '/^Matches:/,/Sports Home/p' match.txt > trimmed.txt
cat trimmed.txt | mail -s " tennis results for: $yest" [username]

Substitute [username] with your local user name or your email id.

Then I kept struggling with the complex html and fell asleep. To be continued.

there is no place like ~
— seen on http://tty1.net

Advertisements

Written by totalrecall

August 30, 2008 at 6:36 pm

Posted in ruby

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: