Part 9 — A complete beginner’s guide to Computer Programming with Clojure: Regular Expressions (REGEX).

Photo by Jamie Haughton on Unsplash

Regular Expressions, otherwise known as REGEX, are a method of applying formulas to manipulate text. REGEX allows you to search, and filter swathes of text and extract only the desired. It can also be used to test a character string for specific features. For example, is it a MAC address, a mobile phone number, or an ISBN number? REGEX is a hugely powerful tool and often applied to forms and databases, as it can help control both input and output.

In Part 8, we used a REGEX to control input for an ISBN number.

The REGEX followed a pattern, #”filter” . The filter usually starts with a ^ to indicate the beginning of a line, and a $ for the end of a line. Square brackets [ ] contain the specific REGEX pattern. For our ISBN, we wanted all numbers and the letter X in both upper and lower case. In addition, we required a hyphen -. To ensure we got a literal hyphen, and not a subtraction command, we used an escape character. The escape character is provided by a backslash \.

An escape character is used to tell the computer to use the following character in a non-standard way. For example, the letter n is simply interpreted as the character n. However, if we escape the letter n with a backslash \, we now get a newline \n.

Returning to our REGEX explanation, we see just before the dollar $ sign, a +. The + symbol indicates one or more. Next, curly brackets are used to specify the number of instances for the previous pattern. For example, when we needed to create a year filter, we used {3}.

For our Year pattern above, [0–9] is any number. If we wanted, instead of 0–9, we could also use \d to specify decimal whole numbers. So, [0–9]{3} is any three-digit number e.g. 496. Which can also be written as [\d]{3}.

Consider the following:

So what happened to “678”?

The re-find command looks for the first instance of a REGEX pattern. Hence, as soon as it found a sequence of numbers (as indicated by #”\d+”, which is the same as [0–9]+), it stopped and returned the result.

What if we want to find all the numbers, “12345” and “678”?

We can use the re-seq command. All re-seq does is create a lazy list. In Clojure programming, the term lazy infers the list, or sequence, is only available after a function has been applied. In our case, re-seq is the function. Because the sequence is created on-demand, this actual creation is called a realization. Some evaluations are infinite, like PI or a third (3.3333 recurring). However, if the evaluation produces a finite value, it is referred to as fully-realized.

As you can see, we have found all groups of numbers. However, what if we wanted to find all groups of numbers and collect them together as a single string. For example, “12345678”.

As already discussed in the last Post on Functions, we have used apply to gather our lazy list into a single string.

Let’s look at another REGEX command, re-matches.

Recall, re-find will find and match the first instance of the requested pattern. But re-seq will carry on and find them all.

Our new pattern, re-matches, will only match the exact pattern. Hence, nil when a non-number appeared in the string.

For those occasions where you require an exact match, consider re-matches.

We can also use wild cards for occasions when we only wish to search with part of a string or we wish to return more than one result e.g. doing a database search.

Here we used a full-stop . to indicate any character and an asterisk * for none or more occurrences.

But what if we are not sure about the case? Consider the following.

Effectively, by placing (?i) in front of our REGEX search string we instigate a search that is not case sensitive. Can you see how useful that is? You can now use an abbreviated search term that ignores the case.

Applying REGEX

Recall from Part 8 the code we used to filter for an ISBN number. Notice, this code follows a pattern:

This if statement pattern can be used to solve a number of programming problems. In programming, this is known as Selection. Recall, a computer program can be created from only three elements: Sequence, Selection, and Repetition. Our pattern above is an example Selection. The if statement checks a condition and if it is true it executes the green line. If it is false, it executes the red line. Again, by replicating the pattern above, you have a powerful tool to control Form input and output, as well as database and file searches.

SUMMARY

Like all things related to computer programming, subjects such as REGEX command a whole army of literature, websites, and debate. A number of sites offer huge amounts of pre-written REGEX for just about any situation. Nevertheless, finding the ‘perfect’ REGEX can prove elusive. In reality, for the budding programmer, REGEX creation often involves much trial and error.

Previous

Part 10 — Files