A selection of short notes I’ve made about Regex and Globbing to help me remember various features. Both techniques allow for matching patterns in strings.
egrep – extended regex search function
egrep allows use of extended regex operators. It is essentially identical to using grep with the -E option.
fgrep – literal search function
fgrep performs the same function as grep, however all characters within the pattern are treated as literals, there is no need to escape special characters. Therefore:
fgrep '**'
#will match the string ** and nothing else.
grep – search function
Arguably the most common tool for using regex, grep searches files for content matching a provided pattern. Can only search within files user has permission to read.
grep 'noobot' noobfile
#searches noobfile for all lines containing the word noobot
grep <regex> noobfile
#searches noobfile for all lines containing patterns that match the expression regex
grep -E
#allows use of extended regex
OPTIONS
-i #case insensitive
-v #inverts the search pattern returning items that DO NOT match
-l #list file name that contains a match rather than the match itself
-r #search recursively including sub directories
-w #match whole word only
-q #quiet mode (no display)
* – Asterisk – Globbing
The * in globbing is a wildcard that represents any character, and any number of that character. Therefore:
echo *
will display a list of every file and sub directory in the current working directory.
The scope can be restricted by adding defined characters. For example:
echo D*
will display a list of every file/directory in the current directory that starts with capital D (perhaps Desktop, Documents and Downloads)
echo D*n*
will display a list of all files/directories in current directory that start with capital D but ALSO contain the lower case n (in previous example Documents and Downloads will be displayed, but not Desktop)
? – Question Mark – Globbing
The ? in globbing is a wildcard that represents any character in a specific position, one time only. Therefore:
echo ?
will display all files in the current directory that only have one character in their name.
echo D??
will display all files in the current directory that start with capital D and are exactly three characters long.
echo D??*
will display all files in the current directory that start with capital D and are AT LEAST three characters long.
[] – Square Brackets – Globbing
The [] in globbing are used to enclose a set of characters in a specific position, one time only. Therefore:
echo [DEW]*
will display all files/directories in the current directory that begin with capital D, E or W ONLY.
The [] are also used to make special characters literal, for example:
echo [?]*
will display all files that begin with a ?, rather than any character. Special characters include ?, * and [].
– – Hyphen – Globbing
The - in globbing can be used within [] to specify a range of characters. Therefore:
echo [D-G]*
will display all files/directories in the current directory that begin with capital D, E, F or G ONLY. The range runs consecutively according to the ASCII character table.
! or ^ – Exclamation Mark or Caret – Globbing
The ! or ^ is used within [] to denote an INVERSE set of characters, ie characters NOT required. Therefore:
echo [!a-z]*
will display all files that DO NOT start with a lower case letter. Files starting with upper case letters, numbers of other characters will be displayed.
. – Full Stop/Period Regex Operator
The . operator matches any ONE single character (except newline)
grep 'mrn..bot' noobfile
#matches any pattern that starts mrn, followed by two random characters, followed by bot
#mrneebot, mrnaabot, mrnuubot, mrn11bot would all be found, for example, but mrnibot will not.
[] – Square Brackets – Regex Operator
Contains a set or range of characters to match against. If the set or range is preceded by a caret (^) pattern matching should occur against characters NOT in the set.
grep [0-9] noobfile
#returns all lines that contain a number
grep [^0-9] noobfile
#will return all lines EXCEPT those that SOLELY contain numbers
grep [.] noobfile
#converts the . to a literal returning all lines that contain a full stop
* – Asterisk Regex Operator
Matches 0 or more occurrences of the preceding character
grep 'mrno*bot' noobfile
#returns lines containing mrnbot, mrnobot, mrnoobot, mrnooobot and so on.
grep 'mrn[ou]*bot' noobfile
#as above but will also return mrnubot, mrnuubot and even mrnuoubot
Because asterisk matches zero occurrences, when looking for one occurrence the following syntax may be better:
grep 'mrnoo*bot' noobfile
#will return mrnobot, mrnoobot and so on BUT NOT mrnbot.
^ – Caret Regex Operator – Front Anchor
In instances where the caret operator is the FIRST character in the expression, then the match MUST START with the characters that follow it.
If the caret appears in the middle of the expression then it is treated as a LITERAL character to match.
If the caret appears as the first character inside [] square brackets, then it is treated as an INVERSE or NOT operator for matching.
grep '^root' /etc/passwd
#search the /etc/passwd file for a line that STARTS with root
Complete line matching can be achieved by using the ^ and $ operators together, with the desired line match between the anchors.
$ – Dollar Regex Operator – Back Anchor
Where $ appears as the FINAL character in the expression, then all characters preceding it must appear at the END of the pattern
grep 'bot$' noobfile
#search noobfile for all lines that END with bot
Complete line matching can be achieved by using the ^ and $ operators together, with the desired line match between the anchors.
() – Brackets – Extended Regex Operator
The brackets are used to group characters together for other operators
echo mrnoonoobot | grep -E 'mr(noo)*bot'
#will match mrnoonoobot as the noo pattern is repeated 0 or more times
#Would also match mrnoobot AND mrbot
The contents of brackets can be referred to in the order in which they appear in the expression by using \1, \2, \3 and so on. This allows for formatting of the result in certain circumstances.
For example:
#if
head /etc/passwd
#returns
mrnoobot:x:5:5:mrnoobot:/mrnoobot:/bin/bash
#the following
head /etc/passwd | sed -r 's/([a-z/]+):([a-z/]+)$/\2:\1/'
#returns
mrnoobot:x:5:5:mrnoobot:/bin/bash:/mrnoobot
#sed has looked for a pattern at the end of a line using two bracketed sets in the FIND part of the function. In the REPLACE side of the function, these patterns are referenced using \2 and \1 to switch the order.
+ – Plus – Extended Regex Operator
+ matches ONE OR MORE of the preceding set/characters, making o+ equivalent to oo*, for example.
echo mrnoobot | grep -E 'mrno+bot'
#will match mrnoobot, mrnobot, and also mrnooobot BUT NOT mrnbot.
echo mrnoobot | grep -E 'mr(noo)+bot'
#will match mrnoobot and also mrnoonoobot BUT NOT mrbot
? – Question Mark – Extended Regex Operator
Allows for an optional defined character
echo mrnobot | grep -E 'mrnoo?bot'
#will match mrnobot and mrnoobot BUT NOT mrnbot or mrnooobot
echo mrnoobot | grep -E 'mr(noo)?bot'
#will match mrbot and mrnoobot BUT NOT mrnoonoobot
{} – Curly Braces – Extended Regex Operator
Specifies the number of times the preceding character should be matched
grep -E 'mrno{0,}bot'
#will match mrnbot, mrnobot, mrnoobot and so on; it is trying to match the rleevant o ZERO OR MORE times
grep -E 'mrno{1,}bot'
#looks for the relevant o ONE OR MORE times
grep -E 'mrno{2}bot'
#looks for the relevant o EXACTLY TWICE
grep -E 'mrno{,4}bot'
#looks for the relevant o FOUR OR FEWER times
grep -E 'mrno{2,4}bot'
#looks for the relevant o TWO TO FOUR times (inclusive)
Therefore:
mrno*bot = mrno{0,}bot
mr(noo)*bot = mr(noo){0,}bot
mrno+bot = mrno{1,}bot
mr(noo)+bot = mr(noo){1,}bot
mrnoo?bot = mrnoo{0,1}bot
| – Pipe – Extended Regex Operator
Acts as a logical OR operator
grep -E 'mrnoobot|mrnuubot'
#will match mrnoobot OR mrnuubot
grep -E 'mrn(oo|uu)bot'
#will also match mrnoobot OR mrnuubot
Therefore:
mrn[ou][ou]bot = mrn(o|u)(o|u)bot
\ – Backslash – Regex Operator
\ acts as an escape character to translate other basic operators into literal characters in an extended expression. However in a BASIC expression, the \ character will translate a literal character into a relevant EXTENDED operator. It can also be used to introduce a designated sequence.
The sequences are:
\b #word boundary operator (whitespace, punctuation)
\B #NOT a word boundary operator
\w #word character class [a-zA-Z0-9]
\W #NOT a word character class [^a-zA-Z0-9]
\d #digit class [0-9]
\s #whitespace character class
\S #NOT whitespace character class
\\ #literal backslash
grep -E '\w{9}'
#searches for an alphanumeric sequence [a-zA-Z0-9] precisely 9 characters long