Regex(Regular Expression) is a sequence of characters that define a search pattern. Regex can be used to check if a string contains the specified search pattern or find all the occurance of a Search pattern. The Idea of the Regular Expression first was invented by the American mathematician Stephen Cole Kleene who described regular language.
Stephen Cole Kleenein this tutorial I try to explain regex in a simple way be examples.
First Example : We want to Extract all the words from the Tweet above, The Text of previous tweet is :
Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD
As you See Words are seprated by space. So, We Split words by space charactor.
but Don’t forget first import needed package to use Regular Expression. we use re package in all of our codes in this page.
After importing re Package then use code below to split sentence by space.
To avoid any confusion while dealing with regular expressions, we would use Raw Strings as r’expression’.
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
allwords = re.split(r' ', text)
print(allwords)
this code separates words by comma. Output:
['Ever', 'wanted', 'to', 'sail', 'the', '#SeaOfThieves?\nWith', 'custom', '@Xbox', 'backgrounds', 'for', 'video', 'conferences,', 'now', 'you', 'can:', 'https://msft.it/6005TaGYD']
in the output above you see one word is “#SeaOfThieves?\nWith”. Do you what is \n in that word?
That is newline character, Means word after \n will be in the new line. in text above the word “with” is in the new line.
If you want Each Line :
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
allwords = re.split(r'\n', text)
print(allwords)
this code separates Lines by comma. Output:
['Ever wanted to sail the #SeaOfThieves?', 'With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD']
a perfect program to split words correctly must consider space and \n and other white spaces, too. Beneath is list of patterns that show white spaces :
\n | New line |
\t | Tab |
\r | Carriage return |
\f | Form feed |
\v | Vertical tab |
Now, if we want to Split Words by any whitespace above. We can use [ ]
[]
– Square brackets
Square brackets specifies a set of characters you wish to match. Examples :
The re.findall()
function is used to find all the matches for the pattern in the string.
str = 'Welcome to Code tips Academy Web Site'
matches = re.findall(r'[abc]', str)
print(matches)
#Output: ['c', 'c', 'a', 'b']
matches = re.findall(r'[abc A]', str)
print(matches)
#Output: ['c', ' ', ' ', ' ', ' ', 'A', 'c', 'a', ' ', 'b', ' ']
Then, We use [ \t\n\r\f\v], to Split words in tweet above by any white Space:
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
allwords = re.split(r'[ \t\n\r\f\v]', text)
print(allwords)
output :
['Ever', 'wanted', 'to', 'sail', 'the', '#SeaOfThieves?', 'With', 'custom', '@Xbox', 'backgrounds', 'for', 'video', 'conferences,', 'now', 'you', 'can:', 'https://msft.it/6005TaGYD']
As a matter of fact we can use \s instead of [ \t\n\r\f\v]
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
allwords = re.split(r'\s', text)
print(allwords)
As It’s shown below result is the same as splitting by [ \t\n\r\f\v]
['Ever', 'wanted', 'to', 'sail', 'the', '#SeaOfThieves?', 'With', 'custom', '@Xbox', 'backgrounds', 'for', 'video', 'conferences,', 'now', 'you', 'can:', 'https://msft.it/6005TaGYD']
Popular Patterns in Regex
Symbol | Description |
---|
. | dot matches any character except newline |
\w | matches any word character i.e letters, alphanumeric, digits and underscore (_ ) |
\W | matches non word characters |
\d | matches a single digit |
\D | matches a single character that is not a digit |
\s | matches any white-spaces character like \n , \t , spaces |
\S | matches single non white space character |
[abc] | matches single character in the set i.e either match a , b or c |
[^abc] | match a single character other than a , b and c |
[a-z] | match a single character in the range a to z . |
[a-zA-Z] | match a single character in the range a-z or A-Z |
[0-9] | match a single character in the range 0 –9 |
^ | match start at beginning of the string |
$ | match start at end of the string |
+ | matches one or more of the preceding character (greedy match). |
* | matches zero or more of the preceding character (greedy match). |
a|b | Matches either a or b. |
re{n,m} | Matches at least n and at most m occurrences of preceding expression. |
re{n} | Matches exactly n number of occurrences of preceding expression. |
re{ n,} | Matches n or more occurrences of preceding expression. |
re? | Matches 0 or 1 occurrence of preceding expression. |
re+ | Matches 1 or more occurrence of preceding expression. |
re* | Matches 0 or more occurrences of preceding expression. |
(re) | Groups regular expressions and remembers matched text (Only the regular Expression before and after parentheses will be matched but only the regular Expression inside the parentheses will be shown to us as input. ). |
(?imx) | Temporarily toggles on i , m , or x options within a regular expression. If in parentheses, only that area is affected. |
(?: re) | Groups regular expressions without remembering matched text. |
Now, We want to find all the hash tags from Tweet above
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
hashTagResult = re.findall(r'#[a-zA-Z0-9_]+', text)
print(hashTagResult)
#[a-zA-Z0-9_]+ means find all the text that
- Starts with #
- There is any alphabet in lower case from a to z, or any alphabet in uppercase from a to z, or a digit from 0 to 9 or _ after #
- + means anything inside [ ] must be at least once.
output
If we want to extract all the callouts
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
atSignResult = re.findall(r'@[a-zA-Z0-9_]+', text)
print(atSignResult)
@[a-zA-Z0-9_]+ means find all the text that
- Starts with @
- There is any alphabet in lower case from a to z, or any alphabet in uppercase from a to z, or a digit from 0 to 9 or _ after @
- + means anything inside [ ] must be at least once.
output
Example : Get All word with at least5 characters.
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
allwords = re.findall(r'\w{5,}', text)
print(allwords)
output
['wanted', 'SeaOfThieves', 'custom', 'backgrounds', 'video', 'conferences', 'https', '6005TaGYD']
In next Example we extract all the Numbers from Text.
sampleText = 'there are 54 apples here. Tempertaure is -23. I have 2124 in my account.'
sampleTextResult = re.findall(r'[-+]?[0-9]+', sampleText)
print(sampleTextResult)
[-+]?[0-9]+ means start of number can be – or + sign ( ? means having – or + is optional and not more that one character is + or -). then I must have at least 1 digit.
Output
To Extract any Urls from Tweet above
urlRegex = '''http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'''
urlRegexResult = re.findall(urlRegex, text)
print(urlRegexResult)
(?:%[0-9a-fA-F][0-9a-fA-F])
this matches hexadecimal character codes in URLs e.g. %2f for the ‘/’ character.
[s]?
means ‘s’ character is optional, but that’s because of the ?
not of the brackets.
Output
['https://msft.it/6005TaGYD']
Now, If we want to Extract all the hashtags and callouts of our sample Tweet.
text = '''Ever wanted to sail the #SeaOfThieves?
With custom @Xbox backgrounds for video conferences, now you can: https://msft.it/6005TaGYD'''
result = re.findall(r'(?:@[a-zA-Z0-9_]+|#[a-zA-Z0-9_]+)', text)
print(result)
Output
['#SeaOfThieves', '@Xbox']
You can also check if a input string has correct format or not by using Regular Expression.
text = 'codetipsacademy@gmail.com'
result = re.match(r'^[\w\.\+\-]+\@[\w]+\.[a-z]{2,3}$',text)
print(result)
Result :
<re.Match object; span=(0, 25), match='codetipsacademy@gmail.com'>
As you See I used ^
and $
in the pattern, this means that text must be start by pattern after ^
and before $
, nothing else must be in the text string. if text Contains Some other string in addition to email matching will fail.
text = 'My email is codetipsacademy@gmail.com'
result = re.match(r'^[\w\.\+\-]+\@[\w]+\.[a-z]{2,3}$',text)
print(result)
Result
None
So, Use ^
and $
usually for text Matching not for Searching.