The Accidental Rubyist

invalid byte sequence in UTF-8

Inconsistencies in Unix programs

leave a comment »

I use the command-line all the time, as well as ruby and Vim. I also use regular expressions all the time.
With all the brains going behind the various unices/unixes one would expect consistent handling of regexes across various unix programs on one system.
While vim does take “\d” in place of [0-9], however, it requires escaping of the “+” but not of the “*”.
So i can say “\d*” but i must say “\d\+”
Most other programs (I am using OS X, perhaps the GNU programs are improved) do not recognize “\d” and other such shorter forms.
The other day I found that expr does not understand the “+” at all, even with escaping!
None of the standard unix programs such as grep, sed, expr understand minimal matching, which from my perspective should have been the default.
The escaping of round brackets differs between vim and the unix programs on one hand, and perl/ruby on the other.

For those needing a quick way of doing minimal regexp matching, here’s something in ruby:
ruby -ne ‘if /<title.*?>(.*?)<\/title>/ then puts $1;end’
The first “.*” after title is there becos the string contains single quotes, and i cannot put a single quote within the command being sent to ruby. If I use double quotes around the command, the “$1” is interpreted by the shell.
So then i tried putting this in a program, to which i could pass a regexp and filenames. Since the regexp passed in would have to be substituted ( if /$regexp/) the command would have to be in double quotes, but then the “$1” also gets substituted by the shell! A little delving into the pickaxe got me an answer …

#!/bin/bash
regexp=”$1″
shift
ruby -ne “if /$regexp/ then puts Regexp.last_match(1);end” $*

Save the above as rugrep.sh, and call as follows:
./rugrep.sh ‘<title.*?>(.*?)<\/title>’ *.html
~~~
“Who is General Failure and why is he reading my hard disk ?”

Advertisements

Written by totalrecall

March 19, 2007 at 12:54 pm

Posted in unix

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: