[Up: First-cut Manifesto] [Robot Wisdom home page]
This is an experiment in making regular expressions (aka regexp, aka regex) easier to remember, by anthropomorphising them.
The basic syntax of regexps is being incorporated into many other applications-- word processors, email filtering, Perl and JavaScript programming, etc-- so it's becoming an important lingua franca for the Internet.
The intent of regular expressions is to extend the 'wildcard' idea when you search for something in a longer piece of text. For example, in most search engines if you search for "text*" (using the asterisk-wildcard) you'll get pages with 'text' or 'texts' or 'textual', etc. And in a few if you search for "te?t" (using the questionmark wildcard) you'll get 'text' and 'tent' but not 'tet' or 'tenet'.
(NB: Asterisk and questionmark have different meanings as regexps-- but the principle is the same.)
Regular expressions allow you to specify enormously complex patterns along the lines of 'find me a place where the first word after the whitespace at the start of a new paragraph ends with a series of numerical digits'. And when a 'Replace' function is added you can use the same patterns to make complex, regular changes even when the patterns you want to change vary slightly, in different ways.
To accomplish this, regexps employ special 'metacharacters' that match common patterns like 'any digit' or 'any lowercase letter'. But these are hard to remember, and harder to decipher!
(Mac users should check out NisusWriter, because its 'PowerFind' utility includes menu-driven regexps that give you all their power without having to memorize anything. A free version can be dowloaded here. But you'll still want to learn the universal symbols described below.)
We'll be picturing the piece of text we want to search as a long line of people, one person for each character.
Normally the texts we'll be searching will be include only the 96 characters defined by ASCII values 20 thru 7E, which includes upper and lowercase alphabets, numerals, and assorted common punctuation, plus a few odds and ends like carriage returns and tabs:
0 1 2 3 4 5 6 7 8 9 A B C D E F
=--------------------------------
2 | ! " # $ % & ' ( ) * + , - . / <- <- <- 20 hex is the
3 | 0 1 2 3 4 5 6 7 8 9 : ; < = > ? blankspace
4 | @ A B C D E F G H I J K L M N O
5 | P Q R S T U V W X Y Z [ \ ] ^ _
6 | ` a b c d e f g h i j k l m n o
7 | p q r s t u v w x y z { | } ~ <- 7F non-printing in US ("rubout")
So the capital "A" in ASCII is 41 hexadecimal, or 65 decimal (4 times 16 plus 1). You should vaguely know which punctuation marks are included in basic ASCII and which require a character set that's been extended beyond 7F (especially 80 to FF). Foreign characters with accents, for example, are not included in basic ASCII.
Where our texts are pictured as lines of people, we'll picture our regular expressions as a company of actors who can play different roles. Their job is to stroll (as an orderly group we've specified) down the longer line people (the text that we've supplied), looking for the first point where each specified actor can play a suitable role.
For example, if our text is:
"Exuberance is Beauty." --William Blake
And our regular expression consists of just two rather-inflexible actors-- "be"-- they'd stroll down the line of text and come to a halt like this:
"Exuberance is Beauty." --William Blake
be
In most word-processors, this match would be indicated by scrolling the page so the characters 'be' are visible, and highlighting them. (I'll be highlighting the matched characters here in red.)
In this example, our two actors were dullards who could only play a single role each. But at the other end of the spectrum is our company's star, the everyman who can play any ASCII role (and more):
. (the period)
The single period matches any letter, number, or punctuation (and a few more odds and ends, besides). If you sent it down the same example, it would stop right away:
"Exuberance is Beauty." --William Blake .
(No, that's not a speck of dust on your monitor-- that's our star, playing the role of 'double-quote'. I'm reminded somehow of Dustin Hoffman... so maybe it's a speck of Dustin.)
So "." corresponds approximately to the wildcard "?" that some search engines use-- our example "te?t" would become "te.t" as a regular expression.
+ and * and ? (quantifiers)
Along with this Everyman, we have two 'cloning wizards' who play no direct role, but instead follow another actor, and allow that actor to play two or more different roles, side by side.
The Cloning Wizards are "+" and "*". Because Dustin is so versatile, the pairing of Dustin with a Cloning Wizard is equivalent to "Select All"-- he happily plays every role in the entire text (cf Peter Sellers in Dr Strangelove, practically).
"Exuberance is Beauty." --William Blake .+
or
"Exuberance is Beauty." --William Blake .*
These two patterns are so overwhelming they're not much use-- the Cloning Wizards are normally reserved for less versatile members of the company. Your Basic Cloning Wizard is the plus sign, "+":
"Exuberance is Beauty." --William Blake
l+
"Exuberance is Beauty." --William Blake
b+
"Exuberance is Beauty!!!!!!!!!!!!" --William Blake!!!!!!!!
!+Your Special Cloning Wizard-- the asterisk-- also supports a Cloak of Invisibility, that can match no character at all. This is useless applied to a single actor because an invisible actor immediately matches the 'invisible role':
"Exuberance is Beauty." --William Blake b*
(Nothing is highlighted red because the match occurs even before the opening doublequote.)
Where this is useful, though, is in longer patterns:
"Exuberance is Beauty." --William Blake
ut+
"Exuberance is Beauty." --William Blake ut*
If you want to match a series of more than one of the same letter, you could use:
"Exuberance is Beauty." --William Blake
ll+or
"Exuberance is Beauty." --William Blake
lll*(Just "l+" would match a single "l" so we can't do it that simply.)
The Cloak of Invisibility is also available without the cloning option as "?":
"Exuberance is Beauty." --William Blake
br?e
"Exuberance is Beauty." --William Blake
be?r
\d and \l and \u
Intermediate between the lowly actors like "b" (who can only match a "b") and the semi-divine Dustin who matches anything, are the real utility players like "\d"-- who can play any numerical digit.
("\d" may appear to be composed of two actors, but the backslash "\" is special and never allowed alone.)
So \d is mathematically inclined, but her much-less-useful evil twin \D (with an uppercase D, not a lowercase d) matches anything but a digit (any letter, punctuation mark, etc). So to remember that the useful form needs to be lowercase, we might visualize a mathematician-poetess, 'd d digits', known to her friends as 'deedee', writing exquisite lowercase poetry on squares of graphpaper in her neighborhood coffeehouse.
The cloning wizards work well with deedee, for any series of digits:
George Lucas directed "THX1138".
\d+
deedee's unmathematical girlfriend is lowercase lulu, or \l. She matches any lowercase letter, while her sister ursula writes her name in lowercase but her poetry in ALL CAPS-- \u (not \U) matches uppercase letters.
[ - ] (ranges)
deedee, lulu, and ursula were part of the Ranger family before they took their current stage names. When they play in towns that knew them before the change, they go back to their old names:
\d = [0-9] or [0123456789] or [01-9] or even [651-47-9] I think \l = [a-z] or equivalent \u = [A-Z] or equivalent
So new members of the Ranger family can be defined to match any combination of characters. The hyphen implies a range that includes all the characters in the ASCII table that fall between the start and endpoint, so [$-&] would match the percent sign as well as $ and &:
0 1 2 3 4 5 6 7 8 9 A B C D E F
=--------------------------------
2 | ! " # $ % & ' ( ) * + , - . /
3 | 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 | @ A B C D E F G H I J K L M N O
5 | P Q R S T U V W X Y Z [ \ ] ^ _
6 | ` a b c d e f g h i j k l m n o
7 | p q r s t u v w x y z { | } ~"te[snx]t" would match test or tent or text.
And ranges work with cloning wizards:
"Exuberance is Beauty." --William Blake \u\l
"Exuberance is Beauty." --William Blake \u\l+
"Exuberance is Beauty." --William Blake u[a-e]+