10

Data Wrangling Basics — Why is Regex In Python Preceded By The Letter r?

 3 years ago
source link: https://towardsdatascience.com/data-wrangling-basics-why-regex-in-python-preceded-by-the-letter-r-a9fa93ab7dad
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Data Wrangling Basics — Why is Regex In Python Preceded By The Letter r?

Explaining the working of regular expressions in Python

Image for post
Image for post
Photo by Cris DiNoto on Unsplash

When writing regular expressions (regex) in Python language, we always start with the letter r. In this tutorial, we will understand the reason behind using it by answering the following questions:

  1. What are the escape sequences?
  2. How Python interpreter interprets escape sequences with or without the letter r?
  3. How regular expressions work in the Python language?
  4. The importance of using the letter r in regular expressions

1. What are the escape sequences?

An escape sequence is a character set that does not represent itself when used in a text definition. It gets translated to some other character or character set that is otherwise difficult to present in a programming language. For example, in Python language, the character set \n represents a new line, and \t represents a tab.Both the character sets, \n, and \t are escape sequences.

The list of standard escape sequences understood by the Python interpreter and their associated meanings are as follows:

Image for post
Image for post

2. How Python interpreter interprets escape sequences with or without the letter r?

To understand its impact on escape sequences, let us have a look at the following example:

#### Sample Text Definition
text_1 = "My name is Ujjwal Dalmia.\nI love learning and teaching the Python language"
print(text_1)#### Sample Output
My name is Ujjwal Dalmia.
I love learning and teaching the Python language#### Sample Text Definition
text_2 = "My name is Ujjwal Dalmia.\sI love learning and teaching the Python language"
print(text_2)#### Sample Output
My name is Ujjwal Dalmia.\sI love learning and teaching the Python language

In text_1 above, the example uses \n character set whereas text_2uses \s. From the escape sequences table shared in section 1, we can see that \n is part of the standard escape sequence-set in Python language, whereas \s is not. Therefore, when we print both the variables, escape sequence \n is interpreted as a new line character by the Python interpreter, whereas \s is left as it is. Note that the definition of both text_1 and text_2 does not include the letter r.

Let us take a step further and include the letter r in the text definition.

#### Sample Text Definition (with letter "r")
text_1 = r"My name is Ujjwal Dalmia.\nI love learning and teaching the Python language"
print(text_1)#### Sample Output
My name is Ujjwal Dalmia.\nI love learning and teaching the Python language#### Sample Text Definition (with letter "r")
text_2 = r"My name is Ujjwal Dalmia.\sI love learning and teaching the Python language"
print(text_2)#### Sample Output
My name is Ujjwal Dalmia.\sI love learning and teaching the Python language

The inclusion of the letter r had no impact on text_2 because \s is not part of the standard escape sequence set in Python language. Surprisingly, for text_1, the Python interpreter did not convert \n into the new line character. It is because the presence of the letter r has transformed the text into a raw-string. In simple terms, the letter r has instructed the Python interpreter to leave the escape sequence as it is.

3.) How regular expression works in the Python language?

To understand how regular expressions work in Python language, we will use the sub() function (re Python package) that substitutes the part of old text with the new text based on the regular expression driven pattern matching. Let us understand this with an example:

#### Importing the re package
import re#### Using the sub function
re.sub("\ts","s", "\tsing")#### Sample Output
'sing'

In this example, we are trying to replace the letter s preceded by a tab with the standalone letter s. One can see from the output that the text \tsing converts to sing. Let us refer to the below flow chart to understand how the sub() function produced the desired result. In the flow chart, we refer to \ts as regex, letter s as new text, and \tsing as old text.

Image for post
Image for post
Substitution using Standard Escape Sequence (Image by User)

Explanation

In the previous example, we have used the character set \t, which is part of the standard escape list in Python language. Therefore, in the first step, the Python interpreter replaced the escape sequence with the tab in both regex text and the old text. Since the regex pattern matched with the input text in the last step, the substitution took place.

In the next example, we will use a different character set, \s, that is not a part of the standard escape list in Python language.

#### Importing the re package
import re#### Using the sub function(this time with a non-standard escape sequence)
re.sub("\ss","s", "\ssing")#### Sample Output
"\ssing"

In this example, we are trying to replace any instance of the letter s preceded by \s with the standalone letter s. It is evident that there was no change in the input text, and the output remained the same as the old text. Again, in the flow chart, we refer to \ss as regex, s as the new text, and \ssing as the old. Let us understand the reason behind this behavior from the below flowchart:

Image for post
Image for post
Substitution using Non-Standard Escape Sequence (Image by Author)

Explanation

In step 1, since \s is not a standard escape sequence, the Python interpreter neither modified the regular expression nor the old text and left them as it is. In step 2, since \s is a metacharacter representing space, it gets converted from \ss to space s. Because in the old text, space s did not exist, there was no positive match, and hence the old-text remained the same.

The two learnings we can draw from this section are:

The evaluation of old text and new text for escape sequences is done only by the Python interpreter. For the regular expression by the Python and the regex interpreter. Therefore, for both old and new text, the outcome of step 1 is their final version, and for regex, it is step 2.

In a scenario where the texts and regex pattern contain only the standard escape sequence, which is part of Python language, we get our desired results. Whereas, when there are additional metacharacters, the results might not be as per expectation.

4.) The importance of using the letter r in regular expressions

From the 2nd example of the previous section, we saw that the regex failed to deliver the expected result. To find the right solution, let us work our way backward.

Image for post
Image for post
Bottom-Up Approach to Solution (Image by Author)

Explanation

To substitute \ss from the old text with the letter s, we expect the regex pattern at step 3 to match the text we want to replace.

To achieve this, we need the regex pattern to be \\ss by the end of step two. When the regex interpreter encounters this pattern, it will convert the metacharacters double backslashes to single, and the output of step 2 will be \ss.

Finally, to ensure that regex at step 2 is \\ss, we pass \\\\ss at step 1. It is because double backslashes are a standard escape sequence of Python language and, as per the table in section 1, the Python interpreter will convert double backslashes to single. To get \\ss as the output of step 1, we supply \\\\ss as our first regular expression. The Python interpreter will convert the \\\\ss text pattern to \\ss.

Therefore, the solution code to the problem mentioned above is as follows:

#### Importing the re package
import re#### Using the sub function with the modified regex
re.sub("\\\\ss","s", "\ssing")#### Sample output
'sing'

We now have the solution to our problem, but the question which remains is, where did we use the letter r? The regex expression arrived at in the previous discussion is a candidate solution. In simple regex requirements, one can work with the above approach, but consider a scenario where the regular expression dictates the use of multiple meta characters and standard escape sequences. It would require us:

  • To first differentiate between standard and non-standard escape sequences
  • Then, appropriately place the right number of backslashes every time we encounter an escape sequence or metacharacters.

In such cumbersome scenarios, taking the below approach helps:

Image for post
Image for post
Replacing multiple escape characters by the letter r (Image by Author)

The only change we have made here is to replace four backslashes with two preceded by the letter r. It will ensure that in step 1, the Python interpreter considers the regular expression as the raw-string and leaves it as it is. Converting regex to a raw string will ensure the following:

  • We are free from the worry of remembering the list of Python standard escape sequences.
  • We do not have to worry about the right number of backslashes for the presence of standard escape sequences or any metacharacters.

Given above, our final and most appropriate solution will be as follows:

#### Importing the re package
import re#### Using the sub function with the modified regex
re.sub(r"\\ss","s", "\ssing")#### Sample Output
'sing'

Closing Note

Watch out for this letter r whenever writing your next regular expression. I hope that this tutorial gave you a good insight into the working of the regular expression.

HAPPY LEARNING ! ! ! !


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK