python_regex.txt

\w - matches an Unicode word character. That's any letter, uppercase or lowercase, numbers, and the underscore character. In "new-releases-204", \w would match each of the letters in "new" and "releases" and the numbers 2, 0, and 4. It wouldn't match the hyphens. \W - is the opposite to \w and matches anything that isn't an Unicode word character. In "new-releases-204", \W would only match the hyphens. \s - matches whitespace, so spaces, tabs, newlines, etc. \S - matches everything that isn't whitespace. \d - is how we match any number from 0 to 9 \D - matches anything that isn't a number. \b - matches word boundaries. What's a word boundary? It's the edges of word, defined by white space or the edges of the string. \B - matches anything that isn't the edges of a word. \w{3} - matches any three word characters in a row. \w{,3} - matches 0, 1, 2, or 3 word characters in a row. \w{3,} - matches 3 or more word characters in a row. There's no upper limit. \w{3, 5} - matches 3, 4, or 5 word characters in a row. \w? - matches 0 or 1 word characters. \w* - matches 0 or more word characters. Since there is no upper limit, this is, effectively, infinite word characters. \w+ - matches 1 or more word characters. Like *, it has no upper limit, but it has to occur at least once. .findall(pattern, text, flags) - Finds all non-overlapping occurrences of the pattern in the text. [abc] - this is a set of the characters 'a', 'b', and 'c'. It'll match any of those characters, in any order, but only once each. [a-z], [A-Z], or [a-zA-Z] - ranges that'll match any/all letters in the English alphabet in lowercase, uppercase, or both upper and lowercases. [0-9] - range that'll match any number from 0 to 9. You can change the ends to restrict the set. [^abc] - a set that will not match, and, in fact, exclude, the letters 'a', 'b', and 'c'. re.IGNORECASE or re.I - flag to make a search case-insensitive. re.match('A', 'apple', re.I) would find the 'a' in 'apple'. re.VERBOSE or re.X - flag that allows regular expressions to span multiple lines and contain (ignored) ([abc]) - creates a group that contains a set for the letters 'a', 'b', and 'c'. This could be later accessed from the Match object as .group(1) (?P<name>[abc]) - creates a named group that contains a set for the letters 'a', 'b', and 'c'. This could later be accessed from the Match object as .group('name'). .groups() - method to show all of the groups on a Match object. re.MULTILINE or re.M - flag to make a pattern regard lines in your text as the beginning or end of a string. ^ - specifies, in a pattern, the beginning of the string. $ - specifies, in a pattern, the end of the string. re.compile(pattern, flags) - method to pre-compile and save a regular expression pattern, and any associated flags, for later use. .groupdict() - method to generate a dictionary from a Match object's groups. The keys will be the group names. The values will be the results of the patterns in the group. re.finditer() - method to generate an iterable from the non-overlapping matches of a regular expression. Very handy for for loops. .group() - method to access the content of a group. 0 or none is the entire match. 1 through how ever many groups you have will get that group. Or use a group's name to get it if you're using named groups.

python_regex.txt

Be the first to comment