The Accidental Rubyist

invalid byte sequence in UTF-8

Extracting data with Hpricot – Night 2

with 4 comments

To the newbie following examples, who has not poured through the docs, and is not a ruby expert (me), Hpricot does give some surprises.
I spent a lot of time figuring this out.

check= trrow.search("//td[@width='30']//img[@alt='Winner']")

I need to see if the html row contains this image or not.
On some rows check is blank, on some it has the entire html as expected.

However, if i do as follows: if check != "" then this always evaluates to true.

I looked everywhere else before i found this out. There was no way for me to differentiate between the check which was blank, and the check which contained the td.
In the case of the blank check, print " #{check}" always printed nothing.

Finally i had to do this, which I don’t like: if "#{check}" != "" then. Reminds me of unix shell scripting.

I had problems cleanly separating text inside nested html such as (see source, search Rafael):

<td width="268" align="left" valign="middle">&nbsp;&nbsp;
<a href="/en_US/bios/overview/atpn409.html" class="alt2"><b>Rafael Nadal</b></a>
&nbsp;ESP&nbsp;(1)</td>


inner_text on the entire element gives me both Rafael Nadal and ESP with “?” inside.
inner_text on the a block gives me the name, but no way to extract just ESP.

There are lots of “??”‘s that come in the text. So in some cases, I just had to parse the inner_text and split on the “?”‘s.

Finally, I did get my program running. It is extremely dependent on the html, the slightest change will make this program inoperable. However, i was able to transform a difficult to visually process format to an easy one.
My output comes out like this:

Rafael Nadal           ESP  (1) def. Ryler DeHeart          USA      6-1 6-2 6-4
James Blake            USA  (9) def. Steve Darcis           BEL      4-6 6-3 1-0 (Retired)
Mardy Fish             USA      def. Paul-Henri Mathieu     FRA (24) 6-2 3-6 6-3 6-4
Gael Monfils           FRA (32) def. Evgeny Korolev         RUS      6-2 6-3 3-6 6-4
Stanislas Wawrinka     SUI (10) def. Wayne Odesnik          USA      6-4 7-6 (8-6) 6-2

The original page is here, see how different it is. I have put the winner on the left side always. The program tennsc.rb lies here. Sample usage:

./tennsc.rb http://www.usopen.org/en_US/scores/cmatch/10ms.html


Tennis scores in an easy to read format:
http://sports.yahoo.com/ten/matches

Advertisements

Written by totalrecall

August 30, 2008 at 6:58 pm

Posted in ruby

Tagged with

4 Responses

Subscribe to comments with RSS.

  1. search returns an array, so you should be writing:
    if !(trrow/’img[@alt=Winner]’).empty?
    # equivalent to if !trrpw.search(‘img[@alt=Winner]’).empty?

    You mention that if the HTML format changed slightly your script would break, but that sure seems to be because you’re way too specific (see my change to the selector in the above example, too).

    And as far as picking out ESP, a little regexp would solve that right quick. /[A-Z]{3}/, perhaps.

    Joshua Paine

    August 31, 2008 at 5:16 am

  2. Thanks a lot Joshua for your immediate reply.

    I did look at the empty() method earlier, but the desc had said:
    “Empty the elements in this list, by removing their insides. “.
    Usually the ruby doc does specify methods that query ( empty? ).

    As far as picking out country as /[A-Z]{3}/, I was not really sure what format it could come as, whether other chars could come in.

    Thanks for pointing out “way too specific” — will put some thought to that.

    I really must first read http://code.whytheluckystiff.net/doc/hpricot/ and the Hpricot showcases before doing further work. Will save a lot of time. Also, my ruby needs desperate brushing up 🙂

    totalrecall

    August 31, 2008 at 9:35 am

  3. Hpricot:Elements extends ruby Array. empty? is an array method–all the array methods still work on Hpricot:Elements lists, but since they belong to the super class they’re not in the ruby doc for Elements.

    I stumbled across this post via Reddit. My only exposure to Hpricot has been using it to work with pdftohtml’s XML output and generate a much nicer HTML output. I found the Hpricot showcase was all I needed to do that, but I’m also right in my first project using Ruby and learning that, so a lot of Ruby stuff (like array methods) is fresh in my mind.

    Joshua Paine

    August 31, 2008 at 9:52 am

  4. It’s important to note that search() can take a &block which will be passed each matched element. Also don’t forget at(), which only returns the first element of an XPath query.

    postmodern

    September 1, 2008 at 11:31 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: