Silicon Soul

Regular Expressions and the grep Command

Table of Contents

Regular Expressions
grep
- Options
Documentation

Regular expressions are a powerful way to create customized patterns to match specific strings of text. The learning curve for regular expressions is steep, but taking the time to understand regular expression (regex) fundamentals will make you a more productive GNU/Linux user.

Note: If you are not familiar with the GNU/Linux command line interface, or you intend to use a script obtained from this site, review the Conventions page before proceeding.

Regular Expressions

Often, regular expressions are recursively constructed from primitives that are themselves regular expressions (i.e., a regular expression can contain regular expressions). The simplest regular expressions are letters, digits, and many other typical characters, which usually stand for themselves.

While shell search patterns must always match beginning at the start of a filename, in programs selecting lines based on regular expressions, it is usually sufficient if the regular expression matches anywhere in a line.

Like with filename expansion, regular expressions use specific characters to denote special meanings. By default, regular expressions are greedy, i.e., they try to match as much of the input string as possible. You can append the ? character to your regex pattern to make it non-greedy.

Anchors

Anchors are meta-characters that match the empty strings at the beginning or ending of a line or word, i.e., they force matches to be anchored.

^: Matches the beginning of the input (e.g., ^C.* matches Cat, but not LUCK).
$: Matches the end of the input (e.g., .*C$ matches ELLIPTIC, but not COOL).
\<: Matches the beginning of a word (a place where a non-letter precedes a letter).
\>: Matches the end of a word (where a letter is followed by a non-letter).

Word brackets (i.e., \< and \>) are a GNU grep command specialty and help prevent returning strings that contain the actual string you are searching for (e.g., super in supergirl).

Character Classes

A character class is a list of characters enclosed by square brackets ([ and ]). It matches any single character in that list.

[]: Character class. Matches any single character contained within the square brackets. Order does not matter and special characters lose their meaning. A hyphen can be used to denote a range (e.g. [0-9]).
[^]: Negated character class. Matches any single character not contained between the square brackets (e.g., [^0-9]).

If you ever create a regex and want to match literals for regex special characters, you can either escape them with a backslash (\) or enclose them in a character class ([]).

Shorthand Character Classes

Shorthand character classes are named classes of characters that are predefined within bracket expressions.

[[:alnum:]]: Matches alphanumeric characters ([A-Za-z0-9]).
[[:alpha:]]: Matches alphabetic characters ([A-Za-z]).
[[:blank:]]: Matches a space or tab, including a line break ([ \t]).
[[:cntrl:]]: Matches control characters ([\x00-\x1F\x7F]).
[[:digit:]]: Matches decimal digits ([0-9], \d).
[[:graph:]]: Matches visible characters ([\x21-\x7E]).
[[:lower:]]: Matches lowercase alphabetic characters [a-z].
[[:print:]]: Matches visible characters or a space ([\x20-\x7E]).
[[:space:]]: Matches space characters, including a line break ([ \t\r\n\v\f], \s).
[[:upper:]]: Matches uppercase alphabetic characters ([A-Z]).
[\w]: Matches any word character ([A-Za-z0-9_]).
[[:xdigit:]]: Matches hexadecimal digits ([0-9A-Fa-f]).

Also, there are special expressions that have a specific meaning in terms of the characters they represent:

.: Matches any single character (except a newline) and enforces that the character must exist (this is similar to how the ? character is used in filename expansion).
\b: Matches the empty string at the edge of a word.
\B: Matches the empty string provided it is not at the edge of a word.
\d: Matches decimal digits.
\s: Matches whitespace, it is a synonym for [[:space:]].
\S: Matches non-whitespace, it is a synonym for [^[:space:]].
\w: Matches word character, it is a synonym for [_[:alnum:]].
\W: Matches non-word character, it is a synonym for [^_[:alnum:]].

Keep in mind, common character encodings for letters are not contiguous. For example, [A-Za-z] matches for _ too. Run the man ascii command for a visual demonstration.

Character Groups

A character group is denoted by parentheses (( and )) and matches the contained characters in their exact order (e.g., (xyz) matches xyz, but not zxy). Often, groups are used to dissect strings.

This is done by writing a regular expression that is divided into several subgroups that match different components of the string. Groups can be nested, as well.

For example, you can segment the string abcd into two different capturing groups with the following regular expression:

(a(b)c)d

The regular expression matches the entire string abcd, the first capturing group includes abc, and the second capturing group includes just b.

Groups are numbered starting at 0. Group 0 is always present and represents what the whole regular expression matches. Subgroups are numbered from left to right, from 1 upward. In the example above, abcd represents group 0, abc represents group 1, and b represents group 2.

Back References

Back references in a regular expression let you specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise.

In other words, a back reference\n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression ((ab)\1 matches abab; (ab*a)x\1 matches abbaxabba), i.e., the regular expression query (denoted by use of parentheses) is repeated n times.

For example, the regular expression \b(\w+)\s+\1\b can be used to detect double words in a string (e.g., the the would be matched for the string I remember the the way it could have been.).

The components of the \b(\w+)\s+\1\b regular expression are:

\b: Asserts position at a word boundary.
(\w+): The first capturing group. This group matches any word character (\w) one or more times (+).
\s+: Matches any whitespace character (\s) one or more times (+).
\1: The backreference that matches the same text as most recently matched by the first capturing group (i.e., the double word).

Non-Capturing Groups

Sometimes, you may want to use a group to denote a part of a regular expression, but you are not interested in retrieving that group's contents via a captured group number. A non-capturing group can accomplish this.

Non-capturing groups are denoted as follows:

(?:ex_regex)

For example, in regular expressions the dot (.) is used to match any character, except for newlines. A non-capturing group can be used to add support for newline characters.

/* This is a
   multi-line comment. */

The \/\*(?:.|\n)*?\*\/ regular expression matches the entire multi-line comment:

\/: Matches the character /.
\*: Matches the character *.
(?:.|\n): The non-capturing group.; The regular expression in the non-capturing group (.|\n) matches either any character except a newline character (.) or (|) a newline character (\n). The characters matched by this group are not separately captured or numbered.
*?: Matches the non-capturing group zero to many times, as few times as possible, expanding as needed.

Concatenation

Two regular expressions can be concatenated. The resulting regex matches any string formed by concatenating two substrings that respectively match the concatenated regular expressions.

Alternation

Two regular expressions can be joined by the infix operator (|), e.g., super(man|woman| girl). As the target string is scanned, expressions separated by | are tried from left to right (e.g., superman, superwoman, supergirl). When one expression completely matches, that branch is accepted.

Repetitions

Repetition operators follow a regular expression and describe how many times the matching string may occur.

?: Preceding item is matched from 0 to 1 repetitions (?? for non-greedy form), e.g., Jon? matches both both Jo and Jon, but not Jonn.
*: Preceding item is matched from 0 to many repetitions (*? for non-greedy form), e.g., Jon* matches Jo, Jon, and Jonn.
+: Preceding item is matched from 1 to many repetitions (+? for non-greedy form), e.g., Jon+ matches both Jon and Jonn, but not Jo.
{n}: Preceding item is matched n times (e.g., a{3}).
{n,}: Preceding item is matched n or more times (e.g., a{3,}).
{,m}: Preceding item is matched 0 to m times (e.g., a{,3}). This is specific to GNU grep.
{n,m}: Preceding item is matched from n to m times (e.g., a{1,3}).

As previously mentioned, regular expressions are greedy by default. For example, ^a.*a applied to the input string abacada matches the entire abacada string, not just aba or abaca. Non-greedy versions try to match as little of the input as possible, e.g., ^a.*?a would match aba.

Precedence

The precedence for regular expression concatenation, alternation, and repetition is as follows:

Repetition
Concatenation
Alternation

For example, ab* is a single a followed by arbitrarily many bs (including none at all), not an arbitrary number of repetitions of ab.

Lookarounds

Lookarounds look around your regular expression matches, i.e., they look at the elements before or after your regex match.

(?=) Positive Lookahead: Matches the expression preceding the lookahead expression (e.g., q(?=u) matches a q followed by a u).
(?!) Negative Lookahead: Matches the expression preceding the lookahead expression that is not followed by the lookahead expression (e.g., q(?!u) matches a q not followed by a u).
(?<=) Positive Lookbehind: Get all matches preceded by a specific pattern (e.g., (?<=a)b matches a b that is preceded by an a).
(?<!) Negative Lookbehind: Get all matches that are not preceded by a specific pattern (e.g., (?<!a)b matches a b that is not preceded by an a).

Flags

Flags are modifiers that redefine regular expression behavior.

i: Case insensitive matching.
g: Global search.
m: Multiline (i.e., anchor metacharacters work on each line).

Flags are placed at the end of a regular expression, e.g., /.+/g.

grep

The grep (Global Regular Expression Print) command prints lines that match patterns (i.e., regular expressions). The basic syntax for grep is:

grep 'ex_pattern' ex_file...

grep understand three different versions of regular expression syntax:

Basic (BRE, Basic Regular Expression)
Extended (ERE, Extended Regular Expression)
Perl (PCRE, Perl-compatible Regular Expression)

By default, grep uses BREs. When using BREs, the operators +, ?, {, |, (, and ) lose their special meaning and must be escaped with a \.

Multiple files can be supplied to grep as arguments. If no file is given as an argument, grep uses the standard input for its input.

For grep, an exit status of 0 means a line was selected (i.e., there was a hit), an an exit status of 1 means no lines were selected (i.e., there were no hits), and an exit status of 2 means an error occurred.

To highlight matches in grep results, set a color in the GREP_COLORS (e.g., GREP_COLORS='0;31') variable and use the --color=always option with the grep command.

Options

Helpful grep options include:

-A ex_number, --after-context=ex_number: Print ex_number lines of trailing context after matching lines.
-B ex_number, --before-context=ex_number: Print ex_number lines of leading context before matching lines.
-c, --count: Output just the number of matching lines.
-C ex_number, --context=ex_number: Output context (i.e., additional lines surrounding the line with the hit.
-e ex_expression, --regexp=ex_expression: Introduce a regular expression to be searched. Can be used multiple times with the same grep command (e.g., grep -e ex_expression_1 -e ex_expression_2).
-E, --extended-regexp: Allow use of extended regular expressions.
-f ex_file, --file=ex_file: Read in search patterns (one per line) from a file. Searches are performed simultaneously.
-H, --with-filename: Print the filename for each match alongside each line match. This is the default behavior for grep when there is more than one file to search.
-i, --ignore-case: Perform case insensitive search.
-I: Process a binary file as if it did not contain matching data. This is equivalent to the --binary-files=without-match option.
-l, --files-with-matches: List just the names of matching files, not the actual line matches.
-L, --files-without-match: List just the filenames of non-matching files, not the actual line matches.
--line-buffered: Causes grep to write its output line by line, instead of buffering larger amounts of output, as it usually does, to make writing more efficient. This can cause a performance penalty.
-m ex_number, --max-count=ex_number: Stop reading a file after ex_number matching lines.
-n, --line-number: Include the line number of matching lines in the output.
-r, --recursive: Search recursively.; If a directory is not specified as an argument, grep will use the current directory as its argument and search recursively. Can only search contents of text files.
-v, --invert-match: Output lines that do not match the regular expression.
-x, --line-regexp: Output only exact whole line matches.
-w, --word-regexp: Output only exact whole word matches.

Documentation

You can learn more about regular expressions via the GNU Grep manual and the Python Software Foundation documentation. For more on grep, run man 1 grep or check out its man page online.

The best way to learn regular expressions is to start looking at examples and using them. Create a list of different text strings that you want to match and see what kinds of regular expressions you can create to target them.

For testing these expressions, an online tool like regular expressions 101 is invaluable. There, you can enter a test string and a regular expression. Then, you get a step-by-step breakdown of what each part of the regex is doing and what it matched (if anything) in your test string.

Enjoyed this post?

Subscribe to the feed for the latest updates.

Page Body

Conscientious Technology Education