Regular Expressions : (Regex|Regexp) part 2

This post is a continuation from Regular Expressions: (Regex|Regexp) Part 1.

Regex Ninja

As I continue to learn and grow with Regular Expressions (regex) I am really enjoying my new found knowledge. Regex isn’t exactly something that might spring to mind as being exciting, or something that stands out as being worth learning, but I’ve come to realise that studying this subject has helped me improve my methodology and thinking of how computers work.

In my daily job I can easily identify the difference between a technally savvy individual, and one who just thinks computers speak English. The later make some shocking assumpions, the former are normally people who possess logical and systomatic thinking, just what you need for wielding regex skills.

In the last guide we covered quite a bit fo the basics to understanding regular expressions. We’ll continue this expanding more on the basics.

We’ll examine more meta-characters, starting with quantifiers.
Quantifiers that we’ll be learning about are the plus symbol (+), the question mark (?) and the star (*).

\$ egrep -w 10 numbers.txt

Above is the example we can work from. We are using the egrep command on Linux to find any lines in numbers.txt, that contain the number 10.

>> Questionable circumstances
What if you wanted to widen that search in egrep, we’ve used the ‘-w’ parameter with egrep that means it’ll only find 10, and not 901020, or 10ton. But what if we did want to find the numbers ’1′, or ’10′. This is where the question mark comes in.

\$ egrep -w 10? numbers.txt

The question mark makes the 0 in ‘10’ optional, so that egrep can successfully return results for ‘1’, and ‘10’. This will only work against the character that is placed directly to the left of the meta-character. We can however add some complexity to this by using parentheses within the regular expression immediately followed by the question mark.

\$ egrep -w ‘(/etc/)?passwd’ FileList.txt

Remember that on the command line, we’ll need to encapsulate the parentheses with single quotes to make sure we don’t confuse your shell such as bash.
This time we’ve searched a text file that contains a list of files and directories on the system. We’ve done a search for either passwd, or /etc/passwd. The practical usefulness of my example is minimal, maybe if you have a list of last files accessed or modified, but the skill itself can be carried forward into regular expressions inside programming languages.

This example above demonstrates that although a ‘?‘ will only work against the preceeding character, it will treat everything between the ‘(‘ and ‘)‘ as being that single preceeding character, allowing you to expand on what is an optional criteria.

>> Shining Star
We’ll use the same example as above, and gain an understanding of how the start (*) can be used as an optional quantifier.

\$ egrep -w 10* numbers.txt

This has done much the same thing at the ‘?‘ before, with one big difference. Unlike the ‘?’ before, the star is not limited to just switching the zero on or off for successful results. The star will find ‘1’, and ‘10’ like the question mark, but it’ll also find ‘100’, and ‘10000’ and as many zero’s as you add to the number.
Where the question mark will bring back successful results with either no zero’s or one, the star will bring back results with no zero’s or more, with no limit.

>> Plus effect

\$ egrep -w 10+ numbers.txt

The plus (+) sign will act very similar to the star, only different here is that it will not be successful in finding ‘1’, but it will be successful in finding ‘10’, ‘100’ and so on. It has to find a minimum of one zero or more, more being infinity zero’s.

Table showing the minimum and maximum limit for a positive search result.

 ? Find the previous character a minimum of 0 times, maximum of 1 * Find the previous character a minimum of 0 times, with no limit to the maximum + Find the previous character a minimum of 1 times, with no limit to the maximum

Imagine for one minute that you are checking through the files of a website that you’re creating. You might want to check for every usage of the ‘size=10’ tags, only you want to find all sizes, ranging from size=0 to size=99 or even size=9876 if needed.
We’ll make use of a character class and a quantifier.

\$ egrep -i size=[0]+ *.html

This command will search through all files ending with .html and find all the tags that are equal to ‘size=’ and then any number from 0-to-infinity.
This example is a bit limited, since the tag could have spaces between the word ‘size’ and the ‘=’, but it does demonstrate the usage of character classes and quantifiers.

>> Setting a brace
We could expand the small table above by including one more feature, the braces. The braces allow you more flexibility when following an optional character, where you get to define the minimum number of characters that must be found for a successful regex search, and a maximum.

\$ egrep 10{1,9} numbers.txt

The above example will allow you to search for 10, 100, 1000 as the braces requires that a successful regex result will find a ‘1’, and a ‘0’, and if applicable followed by another zoro ‘0’ if found. However if we changed the braces from {1,9} to {0,9} this is then saying “find a ‘1’, and optionally find a zero ‘0’ as well, and optionally find up to 9 zeros’.

>> Backreferencing
The best example I’ve seen for explaining this comes from the book I’ve mentioned so many times already, Mastering Regular Expressions.

\$ egrep -wi ‘([a-zA-Z]+) +\1)’ filename.txt

First of all we must clear up the understanding of the single quotes and the egrep parameters, ‘-wi’.  The ‘w’ parameter tells egrep to register the “beginning of a word” and “end of word” markers as descibed in part 1 of this guide. The ‘i’ parameter turns on egrep’s case-insensitive function so if you do a search for ‘the’ it will also find ‘The’ and ‘tHe’.
The single quotes are for your shell’s benefits, so it will be aware that everything inside of them are treated as a single argument for egrep. Without the single quote the shell will think that there are multiple commands or  comback with complaints about the parentheses.

By doing a search for ([a-zA-Z]+) we are looking for any characters, but the ‘-w’ parameters means it’ll find whole words that are surrounded by spaces (this could be a single character word like ‘a’ or a sequence of characters’).
This is followed by a space and then a plus ‘+’ symbol. As you’ll know from above, the plus symbol forces the regex to find a minimum of one or more of the previous character, in this case that is a space.

Lastly we have the ‘\1’ meta-combo, and this is like a reference or variable that the [a-zA-Z] gets compared to. Lets create an example line of text to run our regex against.

This is is our example line.

When you run your egrep [regex] against this line the first section of the regex to be used is [a-zA-Z] which identifies the word ‘This’. Once we have identified this word, the next step in the regex takes place which is to compare the word ‘This’ against value stored in ‘\1’. as it is ‘\1’ is empty, so there is no match, it’ fails and moves on to the next word. Before we move onto the next word, the value found by [a-zA-Z] is moved into ‘\1’ for future reference.
We move onto the next word, and [a-zA-Z] identifies ‘i’ and ‘s’ as the word ‘is’, it then compares this against the value of ‘\1’, and see’s that ‘is’ and ‘This’ do not match, fails and move on to the next word but not before replacing the value of ‘\1’ with the value currently held in in the character class..
[a-zA-Z] character class now comes across the 3rd word, which we can see happens to be ‘is’, when it compares the character class to the backreference ‘\1’, they match! we get back a positive result and have successfully found a repeating pattern where one word directly after another are identical. The user is informed of this successful result, and the search continues until the end.

If your program doesn’t support a similar feature to ‘-w’, try adding ‘\<’ and ‘\>’ without the the quotes, on the ends.

\$ egrep -i ‘\<([a-zA-Z]+) +\1\>’ filename.txt

You can use multiple back references if required, the below example shows two characters classes, with two back references.

\$ egrep -wi ‘([a-zA-Z]) +([0-9]) +\1 \2

>> The Great Escape
What do you do when you want to search for a character is is normally treated as a meta-character, a dot or a star for example?
We use the escape character ‘\’ infront of them:

\$ egrep -i ‘domain\.com’ test.txt

Imagine you had a file which had a line listing the website “domain.com” but elsewhere in the file you might find the words “domain commander”. Without the ‘\’ in the above example you will get back positive results for both the “domain.com” and the “domain commander” and thats because without the ‘\’ the dot is treated as a meta-character that replaces itself with any single character including spaces. Putting the back-slash before the dot makes your regex treat the dot as a standard character rather than a meta-character.
The escape character can be used when searching for parentheses, stars etc as well.

>> Wrapping it up
Just as a side note, it appears that [a-Z] is as acceptable as [a-zA-Z], however this might not be the case for every program or language, and I believe the latter is more common practice and what others will be more use to reading.

Thank you for reading this two part guide on regular expressions. I’m really enjoying learning about this subject right now, and if you’ve enjoyed this two part guides I cannot recommend strongly enough to buy the book Mastering Regular Expressions. Everything you’ve read in this guide is mostly based on what I read and learnt from this book (with additional internet resources) so far.
I’m not on any form of commission or ever met the author Jeffrey E.F. Friedl, but I enjoyed the way his book presented the subject, the examples to learn from and test yourself against, and the extra depth the book goes in far exceeds this guide.

Good luck progressing with regex, and I hope you find it useful.