Grep Regular Expressions

Previous pageReturn to chapter overviewNext page

 

Grep regular expressions allow you to formulate complex searches that are not possible using a basic text search.  GExperts implements a subset of the Perl regular expression syntax, as described below.

 

Simple Matches

 

Any single character matches itself, unless it is a metacharacter with a special meaning, as described below.

 

A series of characters matches that series of characters in the target string.  So, the regular expression search pattern "foo" would match the characters "foo'' in the target string.

 

You can cause characters that normally function as metacharacters to be interpreted literal characters by 'escaping' them by preceding them with a single backslash ("\").  For instance, the metacharacter "^" matches beginning of string, but "\^" matches the literal character "^", "\\" matches "\" and so on.

 

Examples:

 

foobar

Matches string 'foobar'

\^FooBarPtr

Matches '^FooBarPtr'

 

 

Escape Sequences

 

Some special characters may be specified using escape sequences similar to those used in languages like C and Perl: "\n'' matches a newline, "\t'' a tab, etc. More generally, \xnn, where nn is a string of hexadecimal digits, matches the character whose ASCII value is nn. If you need to specify a unicode character, you can use '\x{nnnn}', where 'nnnn' are the hexadecimal digits for the unicode code point.

 

\xnn

ASCII Char with hex code nn

\x{nnnn}

Character with hex code nnnn (use nn for ASCII or nnnn for Unicode)

\t

Tab (same as \x09)

\n

Newline (same as \x0a)

\r

Carriage return (same as \x0d)

\f

Form feed (same as \x0c)

\a

Alarm/bell (same as \x07)

\e

Escape (same as \x1b)

 

Examples:

 

foo\x20bar

Matches 'foo bar' (note space in the middle)

\tfoobar

Matches 'foobar' preceded by a tab

 

 

Character Classes

 

You can specify a character class (to denote any one of a set of characters), by enclosing a list of characters in square brackets [], and this will match any single character from the list.

 

If the first character after the "['' is "^'', the class matches any character not in the list.

 

Examples:

 

foob[aeiou]r

Finds strings 'foobar', 'foober' etc. but not 'foobbr', 'foobcr' etc.

foob[^aeiou]r

Find strings 'foobbr', 'foobcr' etc. but not 'foobar', 'foober' etc.

 

Within a list, the "-'' character can be used to specify a range, so that [a-z] represents any ASCII characters between "a'' and "z'', inclusive.

 

If you want "-'' itself to be a member of a class, put it at the start or end of the list, or escape it with a backslash. If you want ']' you may place it at the start of list or escape it with a backslash.

 

Examples:

 

[-az]

Matches 'a', 'z' and '-'

[az-]

Matches 'a', 'z' and '-'

[a\-z]

Matches 'a', 'z' and '-'

[a-z]

Matches all 26 lower case ASCII characters from 'a' to 'z'

[\n-\x0D]

Matches any of the characters #10, #11, #12, and #13

[\d-t]

Matches any numeric digit, '-' or 't' (see "Predefined Classes" below)

[]-a]

Matches any character in the range of ']'..'a'.

 

 

Metacharacters - Line Separators

 

^

Start of line

$

End of line

\A

Start of text (equivalent to ^ since all searches are single-line)

\Z

End of text (equivalent to $ since all searches are single-line)

.

Any character (period)

 

Examples:

 

^foobar

Matches 'foobar' only if it is at the beginning of line

foobar$

Matches 'foobar' only if it is at the end of line

^foobar$

Matches 'foobar' only if those are the only characters in the line

foob.r

Matches strings like 'foobar', 'foobbr', 'foob1r' and so on

 

The ".'' metacharacter matches any character, by default, but if you can switch off the /s modifier to stop '.' from matching embedded line separators.

 

Unicode line separators are supported (based on http://www.unicode.org/unicode/reports/tr18/: \x2028, \x2029, \x0B, \x0C, and \x85):

 

 

Metacharacters - Predefined Classes

 

\w

Matches any alphanumeric character (including "_")

\W

Matches any non-alphanumeric character

\d

Matches any numeric digit

\D

Matches any non-numeric character

\s

Matches any whitespace character (equivalent to [ \t\n\r\f])

\S

Matches any non whitespace character

 

Note that you may use \w, \d, and \s within custom character classes.

 

Examples:

 

foob\dr

Matches strings like 'foob1r', ''foob6r' and so on but not 'foobar', 'foobbr' and so on

foob[\w\s]r

Matches strings like 'foobar', 'foob r', 'foobbr' and so on but not 'foob1r', 'foob=r' and so on

 

Alphanumeric characters only include only the basic low-ASCII upper and lower case Latin alphabet, the digits 0-9, and '_'.

 

 

Metacharacters - Word Boundaries

 

\b

Match a word boundary

\B

Match a non-(word boundary)

 

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

 

 

Metacharacters - Iterators

 

Any item of a regular expression may be followed by an iterator metacharacter. Using these metacharacters, you can specify the number of occurrences of the previous character, metacharacter, or subexpression.

 

*

Zero or more ("greedy"), similar to {0,}

+

Zero or more ("greedy"), similar to {1,}

?

Zero or one ("greedy"), similar to {0,1}

{n}

Exactly n times ("greedy")

{n,}

At least n times ("greedy")

{n,m}

At least n but not more than m times ("greedy")

*?

Zero or more ("non-greedy"), similar to {0,}?

+?

One or more ("non-greedy"), similar to {1,}?

??

Zero or        one ("non-greedy"), similar to {0,1}?

{n}?

Exactly n times ("non-greedy")

{n,}?

At least n times ("non-greedy")

{n,m}?

At least n but not more than m times ("non-greedy")

 

The digits in curly brackets of the form {n,m} specify the minimum number of times to match the item as "n" and the maximum as "m".  The form {n} is equivalent to {n,n} and matches exactly n times.  The form {n,} matches n or more times.  There is no limit to the size of n or m, but large numbers will use up more memory and be slightly slower.  If a curly bracket occurs in any other context, it is treated as a regular character.

 

"Greedy" means the iterator will match as many characters as possible, whereas "non-greedy" matches takes as few characters as possible.  For example, 'b+' and 'b*' applied to the string 'abbbbc' matches 'bbbb', 'b+?' matches 'b', 'b*?' matches the empty string, 'b{2,3}?' matches 'bb', 'b{2,3}' matches 'bbb', etc.  You can switch all iterators into a "non-greedy" mode using the modifier /g.

 

Examples:

 

foob.*r

Matches strings like 'foobar', 'foobalkjdflkj9r' and 'foobr'

foob.+r

Matches strings like 'foobar', 'foobalkjdflkj9r' but not 'foobr'

foob.?r

Matches strings like 'foobar', 'foobbr' and 'foobr' but not 'foobalkj9r'

fooba{2}r

Matches the string 'foobaar'

fooba{2,}r

Matches strings like 'foobaar', 'foobaaar', 'foobaaaar' etc.

fooba{2,3}r

Matches strings like 'foobaar', or 'foobaaar' but not 'foobaaaar'

 

 

Metacharacters - Alternatives

 

You can specify a series of alternatives for a pattern using a pipe "|'' to separate them, so that fee|fie|foe will match any of "fee'', "fie'', or "foe'' in the target string (as would f(e|i|o)e).  The first alternative includes everything from the last pattern delimiter ("('', "['', or the beginning of the pattern) up to the first "|'', and the last alternative contains everything from the last "|'' to the next pattern delimiter.  For this reason, it is common practice to include alternatives in parentheses, to minimize confusion about where they start and end.

 

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen.  This means that alternatives are not necessarily greedy.  For example, when matching foo|foot against "barefoot'', only the "foo'' part will match, as that is the first alternative tried, and it successfully matches the target string. This distinction is important when you are capturing matched text using parentheses.  Also remember that "|'' is interpreted as a literal within square brackets, so if you write [fee|fie|foe] You're really only matching [feio|].

 

Examples:

 

foo(bar|foo)

Matches strings 'foobar' or 'foofoo'

 

 

Metacharacters - Subexpressions

 

The bracketing construct using parenthesis ( ... ) may also be used to define subexpressions so you can separate the matched text into several subparts.  Subexpressions are numbered based on the left to right order of their opening parenthesis.  The first subexpression has the number '1', etc.  The whole expression match is given the number '0'.  You can also use sub-expressions to perform search and replace ($0 or $& would match the whole expression, $1 would match the first subexpression, $2 the second, etc.

 

Examples:

 

(foobar){8,10}

Matches strings which contain 8, 9 or 10 instances of the 'foobar'

foob([0-9]|a+)r

Matches 'foob0r', 'foob1r' , 'foobar', 'foobaar', 'foobaar' etc.

(John) (Doe)

Matches the literal name "John Doe" with each word being a subexpression (John = $1, Doe = $2). You can then use the replacement string of "$2, $1" to replace that with "Doe, John".

 

 

Metacharacters - Backreferences

 

Metacharacters \1 through \9 are interpreted as backreferences, so in general, \<n> matches previously matched subexpression #<n>.

 

Examples:

 

(.)\1+

Matches 'aaaa' and 'cc' (a single character followed by itself, one or more times)

(.+)\1+

Also match 'abab' and '123123'

(['"]?)(\d+)\1

Matches '"13" (in double quotes), or '4' (in single quotes) or 77 (without quotes) etc.

 

 

Modifiers

 

Modifiers are used to change the behaviour of the regular expression engine.  Any of the modifiers may be embedded within the regular expression itself using the (?...) construct, as described below.  The case sensitivity modifier also has an equivalent option on the grep search screen.

 

i

Do case-insensitive pattern matching (based on your system locale settings).  In this mode, a lower case character can also match the equivalent upper case character, and vice versa.

m

Treat the string to search as multiple lines. That is, change "^'' and "$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string.  This modifier won't generally be useful, because the grep search operates a single line at a time.

s

Treat the string to search as a single line. This changes ".'' to match any character (even a line separator) which it normally would not match.

g

Non standard modifier used to switch all of the subsequent operators into non-greedy mode. So, if this modifier /g is specified, '+' works like '+?', '*'works like '*?' and so on.  Greedy mode is enabled by default.

x

Allow whitespace and comments (see explanation below), to enhance the expression's legibility.

r

Non-standard modifier used to simplify Russian character range entry. If this is turned on, the range à-ÿ automatically also includes the Russian letter '¸', À-ß includes '¨', and à-ß includes all Russian symbols.  This modifier is off by default.

 

The modifier /x needs a little more explanation. It allows the regular expression to contain whitespace that is neither escaped nor within a character class. You can use this to break up your regular expression into more readable parts. The # character is also treated as a metacharacter introducing a line comment, for example:

 

 (  

 (arc) # comment 1

   |   # You can use spaces to format your search expression, and they are ignored

 (efg) # comment 2

 )

 

In this mode, if you want literal whitespace or # characters in the pattern (outside a character class, where they are unaffected by /x), you will either have to escape them or encode them using hex references.

 

 

Perl Extensions To Modifiers

 

(?imsxr-imsxr)

You can use some Perl extensions to change the modifiers on the fly.  If this construction is inlined into a subexpression, it only effects that subexpression.

 

Examples:

 

(?i)Frog-Soup

Matches 'Frog-soup' and 'Frog-Soup'

(?i)Frog-(?-i)Soup

Matches 'Frog-Soup' but not 'Frog-soup'

(?i)(Frog-)?Soup

Matches 'Frog-soup' and 'frog-soup'

((?i)Frog-)?Soup

Matches 'frog-Soup', but not 'frog-soup'

 

 

Perl Extensions for Comments

 

(?#text)

Text inside a comment like the above is ignored. The comment ends at the first ")", so there is no way to put a literal ")" inside the comment.

 

Source Code

 

Note: The free TRegExpr Delphi library is used to implement regular expression searching.