0

Chapter 9 – Perl Regular Expressions

-

Chapter 9 – Perl Regular Expressions

In the last tutorial, we went over the three regex operators. We also learned what regexes were and why we’d use these in our scripts. This tutorial will be going deeper in the many uses to manipulate our data in any way we wish.

Regular Expressions are a huge topic all on its own, there is no possible way to go over all of them here so you may wish to purchase a book on the topic (yes, complete books are dedicated to these) for future reading and understanding.

We’ll use the same originally example on a basic regex before we continue.

while (<STDIN>)
{
  if(m/exit/)
  {
    exit;
  }
}
Special Characters

Before we begin, there are many special characters we need to be aware of. Please note the backslash \ preceding the character(s) is not optional and everything is case-sensitive.

\077 octal character
\a internal alarm
\c[ control character
\D match a non-digit characte
\d match a digit character
\E end case modifier
\e escape key
\f form feed character
\L lowercase all characters until the end case modifier \E is found
\l lowercases preceeding character
\n new line feed
\r return
\S matches a non-white space character
\s matches a white space character
\t inserts tab
\U uppercase until the end case modifier \E is found
\u uppercase the next character
\W matches a non-word character
\w matches a word character


Match (nearly) any characters

One of the most useful special characters in regexes is the . (the dot). This matches any and all characters other than the new line \n. This means, all letters a-z, numbers 0-9, dashes - and any weird @$()! character will be matched except the new line feed.

my $sentence = "this is a sentence, woohoo!";
$sentence =~ s/./T/;
print $sentence;

This is a sentence, woohoo!

The . (dot) matches any non-new line character and since we're doing a substitution for "T", the first character it comes across will be replaced with this letter. The "t" was replaced by "T".

If . is a metacharacter,what if we really wanted to match a period in our regex? This comes up very frequently, especially with the period, and lucky for us there is a very quick solution. If you place a backslash in front of the metacharacter, it acts as a normal character instead of being special.

my $sentence = "We have ourselves a .";
$sentence = s/\./period/;
print $sentence;

We have ourselves a period

The only change from the first example and this one to match the period is adding \. to the regex. So instead of matching any character whatsoever, we're literally matching a period and nothing more.

It is easy for us to match all characters at one time as well. We already know the dot matches nearly all characters, so we use that along with the global modifer /g to substitute any and all matches.

my $sentence = "This is a sentence, woohoo!";
$sentence =~ s/./-/g;
print $sentence;


Character Classes

Instead of matching entire words or phrases, we can also match characters. You can set a character class within square brackets [] such as [abc123] and you can set up a range of characters such as [a-zA-Z]. The latter checks to see if any case of any letter appears in our string.

The range operator can work on parts of the alphabet such as c-f or numerics such as 0-9 or 2-5. Remember everything is case sensitive, that’s why we used [a-zA-Z] to match all cases of the letters.

my $string = “We’re off to see the wizard, the most wonderful wizard of all!”;
if ($string =~ m/!/)
{
  print “hey hey now, there’s no reason to shout!”;
}

In our above example, we are testing to see if we can match our explanation point which we can see at the end of the line, it indeed matches. Now lets do a range test that fails. We’ll rewrite $string so it’s all lowercase and see if it contains any uppercase characters.

my $string = “we’re off to see the wizard, the most wonderful wizard of all!”;
if ($string =~ m/[A-Z]/)
{
  print “I wonder if this will print..”;
}

The above example will not print because nothing in our character range exists. Remember, you can check for any letter or number in a range or you can check for any character inside square brackets[].


Multiple chances with matching

You can do multiple tests to see if a variable or string contains this, that or another thing. If you wanted to see if your string had the word “blue” or the color “red”, you don’t have to write to separate regexes to see if it matches. We use the alternative match pattern instead.

Alternative matching matches one or the other, or in some cases another other :) It is not used to test to see if both cases are found, it will stop at the first match it finds and end, whether this be the first possible match or the sixteenth.

while(<STDIN>)
{
  if(m/red|blue/)
  {
  print “these are my fav colors”;
  exit;
  }
}

This will loop endlessly until it finds something that matches either “red” or “blue” literally. “Red” and “bLue” will not match, neither will any of the alternative ways to write these.

You separate each possible match with a pipe |, and you can use a single word, a single character, a sentence, a number or anything else you want to stick inside. This idea is pretty straight forward so we’ll assume the one example will suffice, remember to not just follow the examples written in these tutorials but to make your own and TEST, TEST, TEST!


Negative Matching

Instead of matching a character class, all digits or any a-zA-Z character, we also have the ability to tell it what not to match.

my $string = “hello there, world”;

if ($string =~ m/[^A-Z]/)
{
  print “Oops!”;
}

The above is saying “If the $string does not contain any A-Z characters, print Oops!”. Instead of checking to see if it does contain A-Z characters, we test to see if they aren’t found. This would be particularly useful if you wanted to test a string to see if it contained any non-numbers (if your script required just numeric input): $string =~ m/^0-9];


Quantifiers

Along with simple matching to see if one thing exists or not, we can check to see how many times it exists. Or rather, make sure it matches exactly the number of times we want.

We do this using the quantifiers from the list below:

* : matches zero or more times
+ : matches one or more times
? : matches either one or zero times
{#} : matches a precise number of times
{#,} : matches at least a certain number of times
{#,#} : matches between first number and second number times

In our first example, we will be using the + quantifier to match one or more instances of our test. If it does, we’ll do a simple substitution.

my $quantifier = “The frog goes rrribbit, rrribbit”;
$quantifier =~ s/r+/r/g;

print $quantifier;

results: The frog goes ribbit, ribbit

Since the + quantifier matches only characters or words that exist atleast one, maybe more, times, we replaced all occurrences of more than one “r” consecutively. This would work if you had one “r” or 100,000 of them. As long as it’s more than one, Perl is happy.

In this next example, we are checking to see if the user typed in between 10 and 30 characters.

while(<STDIN>)
{
  if(m/.{10,30}/)
  {
  print “good for you!\n”;
  }
}

Quantifiers are greedy by nature. This means they’ll slurp up the biggest match as they can that follows the match you’re telling it to. So instead of taking the first applicable match, it will take the biggest (in terms of character size) as possible thus occasionally providing unwanted results.

my $quote = “to be or not to be, that is the question”;
$quote =~ s/.*be/To/;
print $quote;

   Results: To, that is the question

What we tried to do was change the first word from “to” to “To” with a capital “T” by substituting any and all characters before “be”. Remember, .* means to match all characters and as we placed the characters “be” after it, we tried to match the first few characters of our sentence.

What really happened was it found a bigger match– instead of substituting the first match of “to”, it matched everything and substituted everything until the word before the comma.

Regex Anchors (assertions)

Often times we need total control over what to match and where. Rather than matching the first match or taking your chance using a quantifier, we can tell our regex to match specific conditions. These conditions include matching at the beginning or end of a string, match word or non-word boundaries, etc.

Here is a chart of the majority of the anchors we can use that will give us the power we need for accurate results and matching.

^ : matches the beginning of the line
$ : matches the end of the line
\A : matches the beginning of the string
\B : matches a non-word boundary
\b : matches a word boundary
\Z : matches the end of the line
\z : matches the end of the line

The \A and \Z are pretty much the same as ^ and $, the main difference is ^ and $ can match once and once only– at the beginning or the end of a string. The \A and \Z can match multiple times for internal boundaries.

Let’s use an example we came across above. We’re trying to see if the user typed exit. This time, we’re making sure exit is the only thing the user typed. This will not match if the user typed “I need an exit” or “exit is that way”.

while (<STDIN>)
{
  if(m/^exit$/)
  {
    exit;
  }
}

We just used two of the anchors we learned from the list we read earlier. ^exit is explicitly telling it to match exit at the beginning of the string. exit$ is telling it to match at the end of the string. In short, by using both the ^ and $ anchors, you are asking it to match if it’s the only word or set of characters (or even a single character) in the string.

If we just wanted to match the beginning of the string, we would have just used the ^ carrot. Likewise with the $ if all we wanted to do was match the end of the string.

Word boundaries are anything separated by a whitespace, just like our every day text. To match a word boundary means to match anything between one whitespace and the next.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
1

Chapter 8 – Perl Regex Operators

-

Chapter 7 – Perl Regex Operators

Too put it simply, a regex unlocks the power to complete string comparisons. That is, it gives us full control over how we view and manipulate any string (variable) we have. Regexes is short for Regular Expressions and it takes even advanced programmers a while to understand these well enough to code them efficiently and accurately.

Regular expressions is one of the reasons Perl is such a powerful language, mastering these will give you full control over the data you’re using through your scripts. Before we begin, here is a simple regex for us to look at.

$variable =~ s/text/TEXT/gi;


The m// operator

The m// operator is how we deal with matching. This is used against the default variable $_ by default but implementing another variable is just as easy as inserting the variable name. This matching operator works well when you need to know if a string contains a certain character, group of characters or word or a group of words. Instead of saying if ($line eq “test”) which will not work if all we want to know if is the word test exists in $line, we would use m// instead.

The main difference between a simple eq or == and a m//, is one tests for equality and the other tests for the existance of the value inside the string.

my $line;
while($line = <STDIN>)
{
if($line =~ m/exit/) { exit; }
}

This example acts as an infinite loop until it matches what we’re looking for. It’s asking for input, unless the line contains the word exit it’s not going to end for us. From this you can see where our search gets used; the characters or words we want to match are placed inside the //.

my $text = “a blue cow ate the cheese”;
if ($text =~ m/cow/)
{
print “mooooooo”;
}

We are taking a predefined variable $text and seeing if we can match the word cow anywhere in it. As we can see, while running this code we’ll get mooooo back because it can find the word.

Remember, this matching operator doesn’t test for equality, it checks for the existence.


The s/// operator

The second most used operator is the substitution operator. This gives us the power and tools to manipulate our information in any way we wish. We could scan an entire text file and change all the words “red” to “blue” if that’s what we wanted.

This works hand-in-hand with the m/// we just learned in the fact that our words either exist and we can do something with them, or they don’t. This is to say, we can’t substitute any part of our text unless the text we want to change already exists.

my $line;
while($line = <STDIN>)
{
  chomp($line);
  $line =~ s/exit/go/;
  print “Did you say $line?\n”;
}

We’re doing a bit more work in this example because there is a lot more to a substitution than to match words or phrases. This is nearly the same example we used before, if you type any phrase containing the word exit something will happen. In this case, we are s/exit/go which means if it finds the word exit, it will be replaced with the word go.

The best way to learn is to do, so run this script a few times and run a few tests. Type in words that don’t contain exit and some that do so you get familiar with what’s going on.

Unlike the match operator where we have m/word/, we have a new set s/word/neword/. The second set of slashes is the replacement words/characters for what you asked for in the first set.

s/this/that; # change the word from this to that
s/apple/pear; # change the word apple to pear
s/I have a red car/I have a red bike/; # change the entire sentence if it matches

A few things to note before we move on is our s/// will only work once by default and is case-sensitive. Put simply, if we tried to change the word this to that, by default it will only change the first occurance of this and leave the rest untouched and it will not match THIS.

my $text = “the rabbit jumped down the hole where the cow lived.”;
$text =~ s/the/THE/;
print $text;

This example substitutes the lowercase word the to the uppercase THE. By running this script you’ll notice that only the first the that’s found gets replaced giving us the result: THE rabbit jumped down the hole where the cow lived.

my $text = “the rabbit jumped down the hole where the cow lived.”;
$text =~ s/the/THE/;
print $text;

Using /g at the end of our substitution means to substitute globally, instead of just matching the first instance of the word or phrase we’ll substitute it for each time it appears in our data. Taking the same sentence we used before, simply by adding the /g modifier to the end will replace every occurrence of the word the and end with the result: THE rabbit jumped down THE hole where THE cow lived.

my $text = “The rabbit jumped down the hole where the cow lived.”;
$text =~ s/the/THE/gi;
print $text

With making the small change to our sentence (we capitalized the T on The on the first word), our substitution would normally skip this and replace only the because it’s match is case sensitive. The /i modifier changes the default to a case-insensitive substitution. This will s/// (short for substitute) the words The, THe, tHe and so forth with THE and since we’re still using the global modifier /g, it will change all instances of these words.

Sometimes we want to just remove certain words or phrases instead of just s/// them with another word or phrase. This can be done by leaving the second set of slashes empty. Doing so tells Perl that you want to substitute the first set of words for nothing (an empty substitution), therefore removing the words completely.

my $text = “The rabbit jumped down the hole where the cow lived.”;
$text =~ s/the/gi;
print $text;

In this last example, we’re removing the word the in any case and as many times as it can be found in the string. This will produce the results:

rabbit jumped down hole where cow lived


The tr/// operator

The translation operator also works on $_ by default, with this we can make a character-by-character translation. The s/// worked on words, numbers and phrases. This operator works on characters solely.

my $line;
while($line = <STDIN>)
{
  chomp($line);
  $line =~ tr/1/0/;
  print “Did you say $line?\n”;
}

We are translating each occurrence of the character “1″ with “0″. Similar with s///, the 2nd set of slashes is what we’re converting our data into if it matches. For another simple example,

my $text = “bear”;
$text =~ tr/b/t/;
The above gives us the result tear as we are replacing the character “b” with “t”.

We can remove characters we want from our string instead of swapping it for another. We do this using the /d (delete) modifier. We create the character group we want to translate, leave the second set of slashes empty and append d.

my $text = “This is a line of text”;
$text =~ tr/a//d;
print “results: $text”;

Take not the second set of slashes // are to be left empty if you want to delete the characters instead of swapping them with another. In our example above, we removed all the “a”s from our text, which was just one however. A better example would have been to remove an “i” or an “e”, but I’ll leave that up to you to test.

We now have a fairly good understanding of swapping one character with another, Perl allows us to swap more than one at a time. This is to say, we can tr/// as few (if greater than one, of course) or as many characters at a time as we want.

my $text = “This is the line that never ends. Yes it goes on and on my friend. Some people started writing it, not knowing what it was. And they’ll continue writing it forever just because…this is the line that never ends!”;

$text =~ tr/th/ht/d;
print “results: $text”;

You will notice we are translating two different characters, the T and the H. We are swapping them with H and T. You can swap as many or as little as you want like we discussed earlier, but keep in mind it’s in a set order. The first character in the first set will swap with the first character in the second set (our “t” was swapped with “h”), the second character in the first set will always swap with the second letter in the second set (our “h” swapped with “t”).

This example let us switch the H’s and T’s around making funny text :) These are case sensitive too, tr/A// will not be the same as tr/a// and as of the time of writing this, I don’t know of a case-insensitive modifier to remedy this. So you’ll need to use tr/Aa// if you want to catch all of the same character.

Four our last example, let’s have a little fun and remove all the vowels from our text! We would do that by adding each of the vowels to the first set of // and appending the delete modifier.

my $text = “This is the line that never ends. Yes it goes on and on my friend. Some people started writing it, not knowing what it was. And they’ll continue writing it forever just because…this is the line that never ends!”;

$text =~ tr/aeiou//d;
print “results: $text”;
We get the results:

Ths s th ln tht nvr nds. Ys t gs n nd n my frnd. Sm ppl strtd wrtng t,
nt knwng wht t ws. And thy’ll cntn wrtng t frvr jst bcs…ths s th ln tht nvr
nds!

Challenges

1) Of the three regex operators we learned, which one(s) does not alter the data in any way?
————————————————————————
The m// match operator only matches segments of a string, s/// and tr/// are used to change the data.
————————————————————————

2) We are trying to remove all the “a”s from our variable $sentence using s/// but it’s not removing “A”. How can we remove all cases?
————————————————————————
We need to setup a case insensitive substitution. We do this using the case-insensitive modifier, /i.

$sentence =~ s/a/gi;
————————————————————————

3) What is the difference between substitution and translation?
————————————————————————
Substation, or s///, replaces words, numbers or phrases from a string. Translation, or tr///, only translates or swaps characters.

An example of s/// would be: s/word/this/gi, s/apple/pear/gi, s/moon is out/sun is out/gi.

An example of tr// would be: tr/a/e/, tr/1/0/, tr/x/z.
———————————————————————–

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)