Skip to content

Regex & Text Processing

Regular expressions (regex or regexp) are patterns used to match character combinations in strings. They are one of the most powerful tools in a programmer’s toolkit — used for validation, search-and-replace, data extraction, log analysis, and much more. This section covers regex fundamentals through advanced patterns, plus essential Unix text processing tools.


What Are Regular Expressions?

A regular expression is a sequence of characters that defines a search pattern. At its simplest, a regex is just a literal string that matches itself. At its most complex, it can describe intricate patterns with repetition, alternation, grouping, and context-sensitive matching.

Pattern: /error\s+\d{3}/i
Matches:
"Error 404 not found" ✓ (error + space + 3 digits)
"error 500 internal" ✓ (case insensitive)
"ERROR 503 gateway" ✓ (multiple spaces)
"no errors here" ✗ (no digits after "error")
"error code 42" ✗ (only 2 digits)

The Building Blocks

┌──────────────────────────────────────────────────────┐
│ Regular Expression Anatomy │
├──────────────────────────────────────────────────────┤
│ │
│ /^https?:\/\/[\w.-]+\.\w{2,}(\/\S*)?$/i │
│ │ │ │ │ │
│ │ │ Pattern Body │ │ │
│ │ │ │ │ │
│ │ └── Start anchor │ │ │
│ │ │ │ │
│ └──── Delimiter (in some languages) │ │ │
│ │ │ │
│ End anchor ──────────────────────┘ │ │
│ │ │
│ Flags ─────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────┘

When to Use Regex (and When Not To)

Good Use Cases

Use CaseExample PatternNotes
Input validation^\d{3}-\d{2}-\d{4}$SSN format check
Search and replaces/colour/color/gText normalization
Log parsing\[(\d{4}-\d{2}-\d{2})\]\s+(\w+)Extract date and level
Data extraction(\d+\.\d+\.\d+\.\d+)IP addresses from text
URL routing^/users/(\d+)/posts$Web framework routing
Syntax highlightingToken patterns for keywords, strings, commentsText editors
CSV field splitting,(?=(?:[^"]*"[^"]*")*[^"]*$)Comma not inside quotes

Bad Use Cases


Regex Engines: NFA vs DFA

Understanding regex engine types helps you predict performance and behavior.

NFA (Nondeterministic Finite Automaton)

Most programming languages (Python, JavaScript, Java, .NET, Ruby, Perl) use NFA engines:

  • Backtracking — tries one path, backtracks on failure
  • Supports all features: backreferences, lookahead, lookbehind, lazy quantifiers
  • Performance — can be exponential in pathological cases (catastrophic backtracking)
  • Left-to-right, leftmost match — returns the first match found

DFA (Deterministic Finite Automaton)

Some tools (grep, awk, lex) use DFA or hybrid engines:

  • No backtracking — examines each character exactly once
  • Performance — guaranteed linear time O(n)
  • Limited features — no backreferences, no lookahead/lookbehind
  • Leftmost longest match — returns the longest possible match
NFA Matching of "ab|ac" against "ac":
Step 1: Try "ab"
a → matches 'a' ✓
b → does not match 'c' ✗
BACKTRACK to position 0
Step 2: Try "ac"
a → matches 'a' ✓
c → matches 'c' ✓
MATCH at position 0: "ac"
DFA Matching (parallel simulation):
Step 1: Read 'a'
Could be start of "ab" or "ac" → track both simultaneously
Step 2: Read 'c'
"ab" path fails, "ac" path succeeds
MATCH: "ac"
No backtracking needed.

Engine Comparison

FeatureNFADFA
BacktrackingYesNo
BackreferencesYesNo
Lookahead/LookbehindYesNo
Lazy quantifiersYesNo
Worst-case timeExponentialLinear
Used byPython, JS, Java, .NETgrep (traditional), awk, RE2

Regex in Different Languages

import re
text = "Contact us at support@example.com or sales@example.com"
# Search for first match
match = re.search(r'[\w.+-]+@[\w-]+\.[\w.]+', text)
if match:
print(match.group()) # support@example.com
print(match.start()) # 14
print(match.end()) # 34
# Find all matches
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.]+', text)
# ['support@example.com', 'sales@example.com']
# Find all with groups
pattern = r'([\w.+-]+)@([\w-]+\.[\w.]+)'
for match in re.finditer(pattern, text):
print(f"User: {match.group(1)}, Domain: {match.group(2)}")
# Substitution
cleaned = re.sub(r'[\w.+-]+@[\w-]+\.[\w.]+', '[REDACTED]', text)
# "Contact us at [REDACTED] or [REDACTED]"
# Compile for reuse (more efficient in loops)
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.]+', re.IGNORECASE)
results = email_pattern.findall(text)
# Split
parts = re.split(r'[,;]\s*', "a, b; c, d")
# ['a', 'b', 'c', 'd']

Quick Reference Cheat Sheet

Anchors: ^ start of line $ end of line
\b word boundary \B non-word boundary
Character Classes:
. any char (except newline)
\d digit [0-9] \D non-digit
\w word char [a-zA-Z0-9_] \W non-word
\s whitespace \S non-whitespace
Quantifiers:
* 0 or more + 1 or more
? 0 or 1 {n} exactly n
{n,} n or more {n,m} n to m
Groups: (abc) capture group
(?:abc) non-capturing group
(?<name>abc) named group
Alternation: a|b a or b
Lookaround: (?=...) positive lookahead
(?!...) negative lookahead
(?<=...) positive lookbehind
(?<!...) negative lookbehind
Flags: g global i case-insensitive
m multiline s dotAll (. matches \n)

What You Will Learn