Abolfazl Seyed Javadein

Abolfazl is a Data Scientist, Machine Learning Engineer, Founder of CodeTipsAcademy, Technical Consultant and Developing Softwares with the different technologies like Asp.net MVC/Core/Web Form, Python, C/C++ since 2002. Abolfazl has a Master Degree in Statistics and Bachelor Degree in Software Engineering.

All Posts

Regular Expression (Regex) in Python

Regex(Regular Expression) is a sequence of characters that define a search pattern. Regex can be used to check if a string contains the specified search pattern or find all the occurance of a Search pattern. The Idea of the Regular Expression first was invented by the American mathematician Stephen Cole Kleene who described regular language.

in this tutorial I try to explain regex in a simple way be examples.

Suppose we want to find all hash tags and callouts from tweet below :

Ever wanted to sail the #SeaOfThieves? ?‍☠️

With custom @Xbox backgrounds for video conferences, now you can: https://t.co/DjKxP688RW pic.twitter.com/RYfjXJCjng
— Microsoft (@Microsoft) October 2, 2020

First Example : We want to Extract all the words from the Tweet above, The Text of previous tweet is :

Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD

As you See Words are seprated by space. So, We Split words by space charactor.

but Don’t forget first import needed package to use Regular Expression. we use re package in all of our codes in this page.

import re

After importing re Package then use code below to split sentence by space.

To avoid any confusion while dealing with regular expressions, we would use Raw Strings as r’expression’.

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

allwords = re.split(r' ', text)
print(allwords)

this code separates words by comma. Output:

['Ever', 'wanted', 'to', 'sail', 'the', '#SeaOfThieves?\nWith', 'custom', '@Xbox', 'backgrounds', 'for', 'video', 'conferences,', 'now', 'you', 'can:', 'https://msft.it/6005TaGYD']

in the output above you see one word is “#SeaOfThieves?\nWith”. Do you what is \n in that word?

That is newline character, Means word after \n will be in the new line. in text above the word “with” is in the new line.

If you want Each Line :

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

allwords = re.split(r'\n', text)
print(allwords)

this code separates Lines by comma. Output:

['Ever wanted to sail the #SeaOfThieves?', 'With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD']

a perfect program to split words correctly must consider space and \n and other white spaces, too. Beneath is list of patterns that show white spaces :

`\n`	New line
`\t`	Tab
`\r`	Carriage return
`\f`	Form feed
`\v`	Vertical tab

Now, if we want to Split Words by any whitespace above. We can use [ ]

[] – Square brackets

Square brackets specifies a set of characters you wish to match. Examples :

The re.findall() function is used to find all the matches for the pattern in the string.

str = 'Welcome to Code tips Academy Web Site'
matches = re.findall(r'[abc]', str)
print(matches)

#Output: ['c', 'c', 'a', 'b']

matches = re.findall(r'[abc A]', str)
print(matches)

#Output: ['c', ' ', ' ', ' ', ' ', 'A', 'c', 'a', ' ', 'b', ' ']

Then, We use [ \t\n\r\f\v], to Split words in tweet above by any white Space:

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

allwords = re.split(r'[ \t\n\r\f\v]', text)
print(allwords)

output :

['Ever', 'wanted', 'to', 'sail', 'the', '#SeaOfThieves?', 'With', 'custom', '@Xbox', 'backgrounds', 'for', 'video', 'conferences,', 'now', 'you', 'can:', 'https://msft.it/6005TaGYD']

As a matter of fact we can use \s instead of [ \t\n\r\f\v]

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

allwords = re.split(r'\s', text)
print(allwords)

As It’s shown below result is the same as splitting by [ \t\n\r\f\v]

['Ever', 'wanted', 'to', 'sail', 'the', '#SeaOfThieves?', 'With', 'custom', '@Xbox', 'backgrounds', 'for', 'video', 'conferences,', 'now', 'you', 'can:', 'https://msft.it/6005TaGYD']

Popular Patterns in Regex

Symbol	Description
`.`	dot matches any character except newline
`\w`	matches any word character i.e letters, alphanumeric, digits and underscore (`_`)
`\W`	matches non word characters
`\d`	matches a single digit
`\D`	matches a single character that is not a digit
`\s`	matches any white-spaces character like `\n`, `\t`, spaces
`\S`	matches single non white space character
`[abc]`	matches single character in the set i.e either match `a`, `b` or `c`
`[^abc]`	match a single character other than `a`, `b` and `c`
`[a-z]`	match a single character in the range `a` to `z`.
`[a-zA-Z]`	match a single character in the range a-z or A-Z
`[0-9]`	match a single character in the range `0`–`9`
`^`	match start at beginning of the string
`$`	match start at end of the string
`+`	matches one or more of the preceding character (greedy match).
`*`	matches zero or more of the preceding character (greedy match).
`a\|b`	Matches either a or b.
`re{n,m}`	Matches at least n and at most m occurrences of preceding expression.
`re{n}`	Matches exactly n number of occurrences of preceding expression.
`re{ n,}`	Matches n or more occurrences of preceding expression.
`re?`	Matches 0 or 1 occurrence of preceding expression.
`re+`	Matches 1 or more occurrence of preceding expression.
`re*`	Matches 0 or more occurrences of preceding expression.
`(re)`	Groups regular expressions and remembers matched text (Only the regular Expression before and after parentheses will be matched but only the regular Expression inside the parentheses will be shown to us as input. ).
`(?imx)`	Temporarily toggles on `i`, `m`, or `x` options within a regular expression. If in parentheses, only that area is affected.
`(?: re)`	Groups regular expressions without remembering matched text.

Now, We want to find all the hash tags from Tweet above

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

hashTagResult = re.findall(r'#[a-zA-Z0-9_]+', text)
print(hashTagResult)

#[a-zA-Z0-9_]+ means find all the text that

Starts with #
There is any alphabet in lower case from a to z, or any alphabet in uppercase from a to z, or a digit from 0 to 9 or _ after #
+ means anything inside [ ] must be at least once.

output

['#SeaOfThieves']

If we want to extract all the callouts

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

atSignResult = re.findall(r'@[a-zA-Z0-9_]+', text)
print(atSignResult)

@[a-zA-Z0-9_]+ means find all the text that

Starts with @
There is any alphabet in lower case from a to z, or any alphabet in uppercase from a to z, or a digit from 0 to 9 or _ after @
+ means anything inside [ ] must be at least once.

output

['@Xbox']

Example : Get All word with at least5 characters.

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

allwords = re.findall(r'\w{5,}', text)
print(allwords)

output

['wanted', 'SeaOfThieves', 'custom', 'backgrounds', 'video', 'conferences', 'https', '6005TaGYD']

In next Example we extract all the Numbers from Text.

sampleText = 'there are 54 apples here. Tempertaure is -23. I have 2124 in my account.'
sampleTextResult = re.findall(r'[-+]?[0-9]+', sampleText)
print(sampleTextResult)

[-+]?[0-9]+ means start of number can be – or + sign ( ? means having – or + is optional and not more that one character is + or -). then I must have at least 1 digit.

Output

['54', '-23', '2124']

To Extract any Urls from Tweet above

urlRegex = '''http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'''
urlRegexResult = re.findall(urlRegex, text)
print(urlRegexResult)

(?:%[0-9a-fA-F][0-9a-fA-F]) this matches hexadecimal character codes in URLs e.g. %2f for the ‘/’ character.

[s]? means ‘s’ character is optional, but that’s because of the ? not of the brackets.

Output

['https://msft.it/6005TaGYD']

Now, If we want to Extract all the hashtags and callouts of our sample Tweet.

text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''

result = re.findall(r'(?:@[a-zA-Z0-9_]+|#[a-zA-Z0-9_]+)', text)
print(result)

Output

['#SeaOfThieves', '@Xbox']

You can also check if a input string has correct format or not by using Regular Expression.

text = 'codetipsacademy@gmail.com'

result = re.match(r'^[\w\.\+\-]+\@[\w]+\.[a-z]{2,3}$',text)
print(result)

Result :

<re.Match object; span=(0, 25), match='codetipsacademy@gmail.com'>

As you See I used ^ and $ in the pattern, this means that text must be start by pattern after ^ and before $, nothing else must be in the text string. if text Contains Some other string in addition to email matching will fail.

text = 'My email is codetipsacademy@gmail.com'

result = re.match(r'^[\w\.\+\-]+\@[\w]+\.[a-z]{2,3}$',text)
print(result)

Result

None

So, Use ^ and $ usually for text Matching not for Searching.

More to explorer

This image effectively conveys how statistical significance and p-values are used to assess the strength of evidence against the null hypothesis in hypothesis testing

No comment yet, add your voice below!

Abolfazl Seyed Javadein

Regular Expression (Regex) in Python

Popular Patterns in Regex

More to explorer

What is p-value?

YoloV9 Code for Object Detection + Segmentation and Tracking

Extract Voice from a Video and Save that in a file using Python (Speech Recognition)

Add a Comment Cancel reply

Company

features

get started

Weekly Newslatter

welcome.

We connect you to a world of houseplants and urban gardening tailored to your home

watch video

Login to your account