A selection of short notes I’ve made about Regex and Globbing to help me remember various features. Both techniques allow for matching patterns in strings.

egrep – extended regex search function

egrep allows use of extended regex operators. It is essentially identical to using grep with the -E option.

fgrep – literal search function

fgrep performs the same function as grep, however all characters within the pattern are treated as literals, there is no need to escape special characters. Therefore:

fgrep '**'
#will match the string ** and nothing else.

grep – search function

Arguably the most common tool for using regex, grep searches files for content matching a provided pattern. Can only search within files user has permission to read.

grep 'noobot' noobfile
#searches noobfile for all lines containing the word noobot

grep <regex> noobfile
#searches noobfile for all lines containing patterns that match the expression regex

grep -E
#allows use of extended regex

OPTIONS
-i #case insensitive
-v #inverts the search pattern returning items that DO NOT match
-l #list file name that contains a match rather than the match itself
-r #search recursively including sub directories
-w #match whole word only
-q #quiet mode (no display)

* – Asterisk – Globbing

The * in globbing is a wildcard that represents any character, and any number of that character. Therefore:

echo *

will display a list of every file and sub directory in the current working directory.

The scope can be restricted by adding defined characters. For example:

echo D*

will display a list of every file/directory in the current directory that starts with capital D (perhaps Desktop, Documents and Downloads)

echo D*n* 

will display a list of all files/directories in current directory that start with capital D but ALSO contain the lower case n (in previous example Documents and Downloads will be displayed, but not Desktop)

? – Question Mark – Globbing

The ? in globbing is a wildcard that represents any character in a specific position, one time only. Therefore:

echo ?

will display all files in the current directory that only have one character in their name.

echo D??

will display all files in the current directory that start with capital D and are exactly three characters long.

echo D??*

will display all files in the current directory that start with capital D and are AT LEAST three characters long.

[] – Square Brackets – Globbing

The [] in globbing are used to enclose a set of characters in a specific position, one time only. Therefore:

echo [DEW]*

will display all files/directories in the current directory that begin with capital D, E or W ONLY.

The [] are also used to make special characters literal, for example:

echo [?]*

will display all files that begin with a ?, rather than any character. Special characters include ?, * and [].

– – Hyphen – Globbing

The - in globbing can be used within [] to specify a range of characters. Therefore:

echo [D-G]*

will display all files/directories in the current directory that begin with capital D, E, F or G ONLY. The range runs consecutively according to the ASCII character table.

! or ^ – Exclamation Mark or Caret – Globbing

The ! or ^ is used within [] to denote an INVERSE set of characters, ie characters NOT required. Therefore:

echo [!a-z]*

will display all files that DO NOT start with a lower case letter. Files starting with upper case letters, numbers of other characters will be displayed.

. – Full Stop/Period Regex Operator

The . operator matches any ONE single character (except newline)

grep 'mrn..bot' noobfile
#matches any pattern that starts mrn, followed by two random characters, followed by bot
#mrneebot, mrnaabot, mrnuubot, mrn11bot would all be found, for example, but mrnibot will not.

[] – Square Brackets – Regex Operator

Contains a set or range of characters to match against. If the set or range is preceded by a caret (^) pattern matching should occur against characters NOT in the set.

grep [0-9] noobfile
#returns all lines that contain a number

grep [^0-9] noobfile
#will return all lines EXCEPT those that SOLELY contain numbers

grep [.] noobfile
#converts the . to a literal returning all lines that contain a full stop

* – Asterisk Regex Operator

Matches 0 or more occurrences of the preceding character

grep 'mrno*bot' noobfile
#returns lines containing mrnbot, mrnobot, mrnoobot, mrnooobot and so on.

grep 'mrn[ou]*bot' noobfile
#as above but will also return mrnubot, mrnuubot and even mrnuoubot

Because asterisk matches zero occurrences, when looking for one occurrence the following syntax may be better:

grep 'mrnoo*bot' noobfile
#will return mrnobot, mrnoobot and so on BUT NOT mrnbot.

^ – Caret Regex Operator – Front Anchor

In instances where the caret operator is the FIRST character in the expression, then the match MUST START with the characters that follow it.

If the caret appears in the middle of the expression then it is treated as a LITERAL character to match.

If the caret appears as the first character inside [] square brackets, then it is treated as an INVERSE or NOT operator for matching.

grep '^root' /etc/passwd
#search the /etc/passwd file for a line that STARTS with root

Complete line matching can be achieved by using the ^ and $ operators together, with the desired line match between the anchors.

$ – Dollar Regex Operator – Back Anchor

Where $ appears as the FINAL character in the expression, then all characters preceding it must appear at the END of the pattern

grep 'bot$' noobfile
#search noobfile for all lines that END with bot

Complete line matching can be achieved by using the ^ and $ operators together, with the desired line match between the anchors.

() – Brackets – Extended Regex Operator

The brackets are used to group characters together for other operators

echo mrnoonoobot | grep -E 'mr(noo)*bot'
#will match mrnoonoobot as the noo pattern is repeated 0 or more times
#Would also match mrnoobot AND mrbot

The contents of brackets can be referred to in the order in which they appear in the expression by using \1, \2, \3 and so on. This allows for formatting of the result in certain circumstances.

For example:

#if
head /etc/passwd
#returns
mrnoobot:x:5:5:mrnoobot:/mrnoobot:/bin/bash

#the following
head /etc/passwd | sed -r 's/([a-z/]+):([a-z/]+)$/\2:\1/'
#returns
mrnoobot:x:5:5:mrnoobot:/bin/bash:/mrnoobot

#sed has looked for a pattern at the end of a line using two bracketed sets in the FIND part of the function. In the REPLACE side of the function, these patterns are referenced using \2 and \1 to switch the order.

+ – Plus – Extended Regex Operator

+ matches ONE OR MORE of the preceding set/characters, making o+ equivalent to oo*, for example.

echo mrnoobot | grep -E 'mrno+bot'
#will match mrnoobot, mrnobot, and also mrnooobot BUT NOT mrnbot.

echo mrnoobot | grep -E 'mr(noo)+bot'
#will match mrnoobot and also mrnoonoobot BUT NOT mrbot

? – Question Mark – Extended Regex Operator

Allows for an optional defined character

echo mrnobot | grep -E 'mrnoo?bot'
#will match mrnobot and mrnoobot BUT NOT mrnbot or mrnooobot

echo mrnoobot | grep -E 'mr(noo)?bot'
#will match mrbot and mrnoobot BUT NOT mrnoonoobot

{} – Curly Braces – Extended Regex Operator

Specifies the number of times the preceding character should be matched

grep -E 'mrno{0,}bot'
#will match mrnbot, mrnobot, mrnoobot and so on; it is trying to match the rleevant o ZERO OR MORE times

grep -E 'mrno{1,}bot'
#looks for the relevant o ONE OR MORE times

grep -E 'mrno{2}bot'
#looks for the relevant o EXACTLY TWICE

grep -E 'mrno{,4}bot'
#looks for the relevant o FOUR OR FEWER times

grep -E 'mrno{2,4}bot'
#looks for the relevant o TWO TO FOUR times (inclusive)

Therefore:

mrno*bot = mrno{0,}bot
mr(noo)*bot = mr(noo){0,}bot
mrno+bot = mrno{1,}bot
mr(noo)+bot = mr(noo){1,}bot
mrnoo?bot = mrnoo{0,1}bot

| – Pipe – Extended Regex Operator

Acts as a logical OR operator

grep -E 'mrnoobot|mrnuubot'
#will match mrnoobot OR mrnuubot

grep -E 'mrn(oo|uu)bot'
#will also match mrnoobot OR mrnuubot

Therefore:

mrn[ou][ou]bot = mrn(o|u)(o|u)bot

\ – Backslash – Regex Operator

\ acts as an escape character to translate other basic operators into literal characters in an extended expression. However in a BASIC expression, the \ character will translate a literal character into a relevant EXTENDED operator. It can also be used to introduce a designated sequence.

The sequences are:

\b #word boundary operator (whitespace, punctuation)
\B #NOT a word boundary operator
\w #word character class [a-zA-Z0-9]
\W #NOT a word character class [^a-zA-Z0-9]
\d #digit class [0-9]
\s #whitespace character class
\S #NOT whitespace character class
\\ #literal backslash

grep -E '\w{9}'
#searches for an alphanumeric sequence [a-zA-Z0-9] precisely 9 characters long