Regular Expressions

Searching for Digits

\d+ simpler for searching for 1 or more digits than [0-9][0-9]

Example: Find all hastags and links in a tweet:

return re.findall(r'((?:#|http)\S+)', tweet)

Capturing Group

Example:

((?:#|http)\S+)

Innermost group is non capturing:

(?:#|http)

The | is a boolean or so it means “#” or “http”

In its entirety, the above group means “#” or “http” followed by one or more non-whitespace characters.

re.sub()

Example:

def match_first_paragraph():
    """Write a regular expression that returns  'pybites != greedy' """
    html = ('<p>pybites != greedy</p>'
            '<p>not the same can be said REgarding ...</p>')
    return re.sub(r'^<p>(.*?)</p>.*$', r'\1', html)

re.compile()

Example:

re.compile(pattern, flags=0)

Compiles a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.

The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

The sequence

prog = re.compile(pattern)
result = prog.match(string)

is equivalent to

result = re.match(pattern, string)

but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Regular Expressions Reference

Character set

Match any character in the set.

[aeiou]

Negated set

Match any character not in the set

[^aeiou]

Range

Matches a character having a character code between the two specified characters inclusive.

[g-s]

Dot

Matches any character except linebreaks. Equivalent to [^\n\r]. .

word

Matches any word character (alphanumeric & underscore). Only matches low-ascii characters (no accented or non-roman characters). Equivalent to [A-Za-z0-9_]. \w

not word

\W

digit

\d

not digit

\D

whitespace

\s

not whitespace

\S

beginning

^

end

$

word boundary

\b

not word boundary

\B

tab

\t

line feed

\n

null

\0

Capturing group

Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference.

(ha)+

numeric reference

Matches the results of a capture group. For example \1 matches the results of the first capture group and \3 matches the third.

Plus

Matches 1 or more of the preceding token. b\w+

Star

Matches 0 or more of the preceding token. b\w*

Quantifier

Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more. b\w{2,3}

Alternation

Acts like a boolean OR. Matches the expression before or after the |.

It can operate within a group, or on a whole expression. The patterns will be tested in order. b(a|e|i)d