Regular expressions

Tips for navigating the slides:

Press O or Escape for overview mode.
Visit this link for a nice printable version
Press the copy icon on the upper right of code blocks to copy the code

Class outline:

Declarative languages
Regular expression syntax
Regular expressions in Python

Declarative languages

Declarative programming

In imperative languages:

A "program" is a description of computational processes
The interpreter carries out execution/evaluation rules

In declarative languages:

A "program" is a description of the desired result
The interpreter figures out how to generate the result
Examples:
- Regular expressions: Good (?:morning|evening)
- Backus-Naur Form:
  ?calc_expr: NUMBER | calc_op
  calc_op: "(" OPERATOR calc_expr* ")"
  OPERATOR: "+" | "-" | "*" | "/"

Domain-specific languages

Many declarative languages are domain-specific: they are designed to tackle problems in a particular domain, instead of being general purpose multi-domain programming languages.

Language	Domain
Regular expressions	Pattern-matching strings
Backus-Naur Form	Parsing strings into parse trees
SQL	Querying and modifying database tables
HTML	Describing the semantic structure of webpage content
CSS	Styling webpages based on selectors
Prolog	Describes and queries logical relations

Regular expressions

Pattern matching

Pattern matching in strings is a common problem in computer programming.

An imperative approach:


                    def is_email_address(str):
                        parts = str.split('@')
                        if len(parts) != 2:
                            return False
                        domain_parts = parts[1].split('.')
                        return len(domain_parts) >= 2 and len(domain_parts[-1]) == 3

An equivalent regular expression:


                    (.+)@(.+)\.(.{3})

With regular expressions, a programmer can just describe the pattern using a common syntax, and a regular expression engine figures out how to do the pattern matching for them.

Matching exact strings

The following are special characters in regular expressions: \ ( ) [ ] { } + * ? | $ ^ .

To match an exact string that has no special characters, just use the string:


                    Berkeley, CA 94720

Fully matched by:

But if the matched string contains special characters, they must be escaped using a backslash.


                    \(1\+3\)

Fully matched by:

The dot

The . character matches any single character that is not a new line.


                    .a.a.a

Fully matched by:

It's typically better to match a more specific range of characters, however...

Character classes

Pattern	Description	Example
`[]`	Denotes a character class. Matches characters in a set (including ranges of characters like `0-9`). Use `[^]` to match characters outside a set.	`[top]` `[h-p]`
`.`	Matches any character other than the newline character.	`1.`
`\d`	Matches any digit character. Equivalent to `[0-9]`. `\D` matches the inverse (all non-digit characters).	`\d\d`
`\w`	Matches any word character. Equivalent to `[A-Za-z0-9_]`. `\W` matches the inverse.	`\d\w`
`\s`	Matches any whitespace character: spaces, tabs, or line breaks. `\S` matches the inverse.	`\d\s\w`

Quantifiers

These indicate how many of a character/character class to match.

Pattern	Description	Example
`*`	Matches 0 or more of the previous pattern.	`a*`
`+`	Matches 1 or more of the previous pattern.	`lo+l`
`?`	Matches 0 or 1 of the previous pattern.	`lo?l`
`{}`	Used like `{Min, Max}`. Matches a quantity between Min and Max of the previous pattern.	`a{2}` `a{2,}` `a{2,4}`

Combining patterns

Patterns P₁ and P₂ can be combined in various ways.

Combination	Description	Example
`P₁P₂`	A match for P₁ followed immediately by one for P₂.	`ab[.,]`
`P₁\|P₂`	Matches anything that either P₁ or P₂ does.	`\d+\|Inf`
`(P₁)`	Matches whatever P₁ does. Parentheses group, just as in arithmetic expressions.	`(<3)+`

Anchors

These don't match an actual character, they indicate the position where the surrounding pattern should be found.

Pattern	Description	Example	What parts match?
`^`	Matches the beginning of a string.	`^aw+`	aww aww
`$`	Matches the end of a string.	`\w+y$`	stay stay
`\b`	Matches a word boundary, the beginning or end of a word.	`\w+e\b`	broken bridge team

Regular expressions in Python

Support for regular expressions

Regular expressions are supported natively in many languages and tools.

Languages: Perl, ECMAScript, Java, Python, ..

Tools: Excel/Google Spreadsheets, SQL, BigQuery, VSCode, grep, ...

Raw strings

In normal Python strings, a backslash indicates an escape sequence, like \n for new line or \b for bell.


                    >>> print("I have\na newline in me.")
                    I have
                    a newline in me

But backslash has a special meaning in regular expressions. To make it easy to write regular expressions in Python strings, use raw strings by prefixing the string with an r:


                    pattern = r"\b[ab]+\b"

The re module

The re module provides many helpful functions.

Function	Description
`re.search(pattern, string)`	returns a `Match` object representing the first occurrence of pattern within string
`re.fullmatch(pattern, string)`	returns a `Match` object, requiring that pattern matches the entirety of string
`re.match(pattern, string)`	returns a `Match` object, requiring that string starts with a substring that matches pattern
`re.findall(pattern, string)`	returns a list of strings representing all matches of pattern within string, from left to right
`re.sub(pattern, repl, string)`	substitutes all matches of pattern within string with repl

Match objects

The functions re.search, re.match, and re.fullmatch all take a string containing a regular expression and a string of text. They return either a Match object or, if there is no match, None.

re.search requires that the pattern exists somewhere in the string:


                    import re

                    re.search(r'-?\d+', '123 peeps')           # <re.Match object>
                    re.search(r'-?\d+', 'So many peeps')       # None

Match objects are treated as true values, so you can use the result as a boolean:


                    bool(re.search(r'-?\d+', '123'))           # True
                    bool(re.search(r'-?\d+', 'So many peeps')) # False

Inspecting a match

re.search returns a Match object representing the first occurrence of pattern within string.


                    title = "I Know Why the Caged Bird Sings"
                    re.search(r'Bird', title)   #

Match objects carry information about what has been matched. The Match.group() method allows you to retrieve it.


                    x = "This string contains 35 characters."
                    mat = re.search(r'\d+', x)
                    mat.group(0)  # 35

Match groups

If there are parentheses in a patterns, each of the parenthesized groups will become groups in the match object.


                    x = "There were 12 pence in a shilling and 20 shillings in a pound."
                    mat = re.search(r'(\d+)[a-z\s]+(\d+)', x)


                    mat.group(0)  # '12 pence in a shilling and 20'
                    mat.group(1)  # 12
                    mat.group(2)  # 20
                    mat.groups()  # (12, 20)

It's also common to use parentheses in combination with quantifiers and other modifiers, however.

Finding multiple matches

re.findall() returns a list of strings representing all matches of pattern within string, from left to right.


                    locations = "CA 91105, NY 13078, CA 94702"
                    re.findall(r'\d\d\d\d\d', locations)
                    # ['91105', '13078', '94702']

Resolving ambiguity

Ambiguous matches

Regular expressions can match a given string in more than one way. Especially when there are parenthesized groups, this can lead to ambiguity:


                    mat = re.match(r'wind|window', 'window')
                    mat.group()  # 'wind'
                    
                    mat = re.match(r'window|wind', 'window')
                    mat.group() # 'window'
                    
                    mat = re.match(r'(wind|window)(.*)shade', 'window shade')
                    mat.groups() # ('wind', 'ow ')
                    
                    mat = re.match(r'(window|wind)(.*)shade', 'window shade')
                    mat.groups() # ('window', ' ')

Python resolves these particular ambiguities in favor of the first option.

Ambiguous quantifiers

Likewise, there is ambiguity with *, +, and ?.


                    mat = re.match(r'(x*)(.*)', 'xxx')
                    mat.groups()  # ('xxx', '')

                    mat = re.match(r'(x+)(.*)', 'xxx')
                    mat.groups()  # ('xxx', '')
                    
                    mat = re.match(r'(x?)(.*)', 'xxx')
                    mat.groups()  # ('x', 'xx')
                    
                    mat = re.match(r'(.*)/(.+)', '12/10/2020')
                    mat.groups()  # ('12/10', '2020')

Python chooses to match greedily, matching the pattern left-to-right and, when given a choice, matching as much as possible while still allowing the rest of the pattern to match.

Lazy operators

Sometimes, you don’t want to match as much as possible.

The lazy operators *?, +?, and ?? match only as much as necessary for the whole pattern to match.


                    mat = re.match(r'(.*)(\d*)', 'I have 5 dollars')
                    mat.groups() # ('I have 5 dollars', '')
                    
                    mat = re.match(r'(.*?)(\d+)', 'I have 5 dollars')
                    mat.groups() # ('I have ', '5')
                    
                    mat = re.match(r'(.*?)(\d*)', 'I have 5 dollars')
                    mat.groups() # ('', '')

The ambiguities introduced by *, +, ?, and | don’t matter if all you care about is whether there is a match!

Exercises

Name That Pattern! #1


                    [A-Za-z]{3}

Fully matched by:

What's a valid input? AUS, aus
What's an invalid input? australia, au

Name That Pattern! #2


                    \d{4}-\d{2}-\d{2}

Fully matched by:

What's a valid input? 2020-03-13
What's an invalid input? 2020/03/13, 03-13-2020

Name That Pattern! #3


                    [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$

Fully matched by:

What's a valid input? someone@someplace.org
What's an invalid input? someone@mod%cloth.co

RegEx Makeover! #1

Let's make a regular expression to match 24-hour times of the format HH:MM.

First draft: [0-2]\d:\d\d

What invalid times would that match? 24:99
How do we fix minutes? [0-2]\d:[0-5]\d
How do we fix hours? ((2[0-3])|([0-1]\d)):[0-5]\d

Try in regexr.com!

Exercise: Stocks

Make a regular expression to match any tweet talking about GME stock.


                    import re

                    def match_gme(tweet):
                        """
                        >>> match_gme('GME')
                        True
                        >>> match_gme('yooo buy GME right now!')
                        True
                        >>> match_gme('#HUGME')
                        False
                        >>> match_gme('#HUGMEHARDER')
                        False
                        """
                        return bool(re.search(______, tweet))

Tips

When learning, use sites like regexr.com
Get used to referencing a regular expressions cheat sheet

⚠️ A word of caution ⚠️

Regular expressions can be very useful. However:

Very long regular expressions can be difficult for other programmers to read and modify. 🤯
See also: Write-only
Since regular expressions are declarative, it's not always clear how efficiently they'll be processed. 🐌 Some processing can be so time-consuming, it can take down a server.
Regular expressions can't parse everything! Don't write an HTML parser with regular expressions.