Sunday 9 December 2018

Regular Expressions (or how I learnt to stop worrying and love SED)

Let me start from the outset and explain that this is old skool stuff. If you don't like old skool and simply MUST have a single-purpose app for everything, then you may as well stop reading now. However, if you regularly find you need alter the content of a stream of data in some way and you've never used regular expressions, then this is the ticket for you.

Regular Expressions

The human mind is particularly good at recognising patterns - even patterns that are kinda rubbery, however we tend not be very fast at it. On the other hand computers are also fast at recognising patterns they fall down at the rubbery bit. The keys is to tie the both the fast and rubbery bits together with a "grammar" that is clear, concise and effective. Regular expressions are an effective way to do this. You can get very detailed information on regular expressions here. I will only be providing an intro on this blog.

A regular expressions is usually defined between forward slashes /like this/. Pretty much anything can be a regular expression, however the real strength of it comes from utilising special characters and operators. For example:

/[Nn]ame/

will match "name" and "Name" as anything within [] brackets will match.

/gr[ae]y/

will match "grey" or "gray". Similarly:

/Section [A-C]/

will match "Section A", "Section B" and "Section C". An email address can be matched with:

/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Zaoz]{2,4}

Another example is matching an IPv4 address. If I see 192.168.100.100 - I know immediately it is an IP address, but how to find one using regular expressions?

/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/

Normally, a '.' will mean match anything, however using a backslash before anything renders it literally. So, if we want to search for a dot, then we need to use '\.'. The curly braces indicate the range of how many times the preceding statement is repeated - in this case {1,3} means it can be repeated from 1 to 3 times. If we were searching for an Australia style phone number, our grammar would be:

/\(0[1-4,7-9]\) [0-9]4\-[0-9]4/

The special characters used in regular expressions include:
  • \ the backslash escape character.
  • ^ the caret is the anchor for the start of the string,
  • $ the dollar sign is the anchor for the end of the string.
  • { } the opening and closing curly brackets are used as range quantifiers.
  • [ ] the opening and closing square brackets define a character class to match a single character.
  • ( ) the opening and closing parenthess are used for grouping characters (or other regexes).
  • . the dot matches any character except the newline symbol.
  • * the asterisk is the match-zero-or-more quantifier.
  • + the plus sign is the match-one-or-more quantifier.
  • ? the question mark is the match-zero-or-one quantifier. 
  • | the vertical pipe separates a series of alternatives.
  • < > the smaller and greater signs are anchors that specify a left or right word boundary.
  • - the minus sign indicates a range in a character class
  • & the ampersand is the "substitute complete match" symbol.

GREP

The utility Get Regular Expressions and Print is one I use almost as much as ls. Grep will print only the lines of a file (or a data stream) that satisfy the supplied regular expression. The switch '-v' will print the lines that don't match. For example

grep -i dog file.txt

Will print lines of file.txt containing the word 'dog' regardless of case. More practical is to use it with a stream. eg:

dmesg | grep eth0
grep -i documentroot /etc/httpd/conf/httpd.conf
mount | grep  'dev\/sd'

SED

I was first introduced to sed by a sysadmin at the University of Wollongong when I was an undergrad student there. It was like the missing arm I never knew about. Since then I've used sed almost daily. Short for Stream EDitor, sed takes a stream of data and performs operations on it specified by the regular expression. The operator I use most is the 's' or substitution operator. This operator looks for a regular expression in a data stream and replaces it with a substitute regular expression (which can be nothing). The following sed example is one I use to automate the creation of a wordpress configuration file:

cat wp-config-sample.php | sed 's/database_name_here/wordpress/g' | sed 's/username_here/wordpressuser/g' | sed 's/password_here/password/g' > wp-config.php
 
Of particular use is the sed one-liners page, which gives lots of very useful, but simple sed scripts.

AWK

The final tool in the old skool trinity is awk: Almost a programming language in itself, awk is a powerful script processing engine that provides the glue between sed and grep. Its utility lies in its ability to work both from the command line with streams of data and to execute as a script from a file. An example of a simple use of awk in conjunction with grep and sed is:

top -n 1 -b | grep "load average:" | sed -e 's/,//g' | awk '{printf "%s\t%s\t%s\n", $12,  $13, $14}'

In fact, using the awk print & printf commands to isolate fields in a data stream and format them  is one of its most useful features.

PERL

Tying all of this together is the programming language PERL, but by embracing PERL you are moving out of old skool. PERL is a complete programming language in its own right. Anything you can do with awk, sed and grep can be done in PERL. In addition, it is a rich, full featured language.

Actually it's two languages. 

PERL 6 is not simply an upgrade to PERL 5. PERL 6 is more a sister language to PERL 5. It was developed with the intention of beefing up PERL with some of the features of more modern scripting languages like Ruby and Python. It was recognised early that many of the grammar structures of PERL 5 were incompatible with these desired features - hence a rewrite of the language was needed.

In other words -  Perl6 was written because various people in the Perl community felt that they should rewrite the language but without any definite goal of where they wanted to go or why. The vast majority of PERL users and programmers have stuck with PERL 5.

Another problem with PERL is its use of libraries, although this could also be marked as a strength. However keeping the libraries up to date requires juggling, so CPAN (Comprehensive Perl Archive Network ) was developed.  The idea of CPAN is to provide a network and a set of utilities to provide access to compatible repositories. The CPAN utility, however, is quite complex and another utility (cpanm - for cpan minus) was developed to simplify it. Yet another utility was developed called cpanoutdated to assist to determine library and module compatibility. After a great deal of experimentation, I developed the following method of ensuring a satisfactory base PERL installation on CentOS 7:

yum -y install perl perl-Net-SSLeay perl-IO-Zlib openssl perl-IO-Tty cpan
cpan App::cpanminus
cpanm Net::FTPSSL
cpanm App::cpanoutdated
cpan-outdated -p | cpanm
cpan-outdated -p | cpanm

I like PERL, but it's an unwieldy tool. Sometimes, a shovel has more utility than a backhoe - but that's a metaphor for another blog entry.

For some time I forced myself to use PERL whenever I needed to do something mildly complex involving regular expressions. It took lots of time and involved debugging scripts. After a while I realised that the time spent disciplining myself with PERL was not worth it. What took me 30+ minutes in PERL could be done in 10 minutes in sed. For very complex scripts, Ruby or Python were better choices, but they all lacked the true grit of the old skool trifecta.

So, I learnt to stop worrying that nobody could read or understand these cryptic regular expressions and enjoy the simplicity, beauty and power of awk, grep and sed.

2 comments:

  1. well car manufacture make car safe as possible, and if a defect is there there is a recall, when did you hear of a vax recall. Why don't they just keep trying until vax are safe before releasing them...even better work out a way to find out if it safe for that individual.

    ReplyDelete
    Replies
    1. You've commented in the wrong area here. But since you asked, there have been three vaccine recalls in the past ten years. Vaccines go through a much more intense level of testing than motor vehicles, so one would expect a consequently higher safety profile. In addition, much of the work on vaccines is built upon pre-existing work that already has a good track record of success.

      Your second question "Why don't they just keep trying until vax are safe before releasing them...even better work out a way to find out if it safe for that individual". If you've read the article, it's called Risk Management. At some point you need to decide where it is unethical to delay releasing a medication because of the lives it will save.

      Think of it like autonomous vehicles. If they were legalised and mandated right now, the road toll would drop - but by how much? The relaity is that autonomous vehicles will need to prove themselves substantially safer (orders of magnitude safer) than human controlled vehicles. But no matter how well they are made, they can never be 100% safe. Regulators know, however, that there are people who will regard even one casualty as proof that autonomous vehicles are note safe - regardless of the situation. This is called the Nirvana Fallacy - the assumption that anything not 100% safe is 0% safe, when in reality it is risk management.

      https://www.cdc.gov/vaccinesafety/concerns/recalls.html

      Delete