Extracting data with Hpricot – Night 1
I saw the tutorial on scraping gmail with Hpricot and Mechanize, and thought I’d try it. Strangely, it works from irb, but gives errors when run as a ruby program. Mainly, page does not have a method names uri. And page.class gives String.
Also, the url of the actual email has two “?” in it. Had to remove one.
The example did not state that “email” variable extracted only gives links present in the emails. It took me hours of pouring through the resultant (huge) html to get this. (Dumb me!). So I should have used open-uri to read the email. By then I remembered that I already have fetchmail reading my new mail using IMAP and I really don’t need this program.
So I moved my attention to a larger problem at hand. I read more tutorials such as this, which referred me to the priceless Firebug addon which really helps navigate complex htmls. My problem is: each morning fetch the latest tennis scores in a simple format and email me. I always forget to open that page and miss out on news. The page is awfully complex. Each line has umpteen tables and td’s and most have no classname.
After much struggle, I went to plan B. I found another page that had all the results in a simple concize form (single line for each match) on yahoo. Used links to dump the output, used sed to filter out and mail me the page.
Aah, i had a script running in minutes, and i could even raise an alarm if my fav players had played using good ole grep.
The script (source) boils down to these 3 lines:
links -dump "http://sports.yahoo.com/ten/matches" > match.txt
sed -n '/^Matches:/,/Sports Home/p' match.txt > trimmed.txt
cat trimmed.txt | mail -s " tennis results for: $yest" [username]
Substitute [username] with your local user name or your email id.
Then I kept struggling with the complex html and fell asleep. To be continued.
there is no place like ~
— seen on http://tty1.net