Text Processing Tools
Unix text processing tools are some of the most powerful utilities in a developer’s arsenal. Combined with regular expressions, they enable rapid data extraction, transformation, and analysis from the command line. This page covers the essential tools with practical, real-world examples.
grep — Search for Patterns
grep (Global Regular Expression Print) searches files for lines matching a pattern.
Basic Usage
# Search for a string in a filegrep "error" application.log
# Case-insensitive searchgrep -i "error" application.log
# Search recursively in directoriesgrep -r "TODO" src/
# Show line numbersgrep -n "function" script.js
# Show count of matchesgrep -c "ERROR" application.log
# Show filenames onlygrep -l "import" src/*.py
# Invert match (lines NOT containing pattern)grep -v "DEBUG" application.log
# Show context (lines before/after match)grep -B 2 -A 5 "Exception" application.log # 2 before, 5 aftergrep -C 3 "Exception" application.log # 3 before and after
# Extended regex (ERE) — enables +, ?, |, () without escapinggrep -E "error|warning|fatal" application.log
# Perl-compatible regex (PCRE) — enables lookahead, named groups, etc.grep -P "(?<=level=)\w+" application.log
# Match whole words onlygrep -w "error" application.log # matches "error" but not "errors"
# Fixed strings (no regex interpretation — faster)grep -F "192.168.1.1" access.log
# Quiet mode (exit code only — useful in scripts)if grep -q "error" application.log; then echo "Errors found!"fiPractical grep Examples
# Find all IP addresses in a log filegrep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b' access.log
# Find all TODO/FIXME comments in source codegrep -rn "TODO\|FIXME\|HACK\|XXX" src/
# Find function definitions in Python filesgrep -rn "def \w\+" --include="*.py" src/
# Find files that do NOT contain a patterngrep -rL "Copyright" src/ # files missing copyright header
# Count occurrences per filegrep -rc "import" src/*.py | sort -t: -k2 -rn
# Find lines matching multiple patterns (AND logic)grep "error" application.log | grep "database" | grep -v "retry"
# Extract email addressesgrep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' contacts.txtripgrep (rg) — Modern grep Alternative
ripgrep is a faster, more user-friendly alternative to grep. It respects .gitignore, searches recursively by default, and uses Rust’s regex engine (guaranteed linear time).
# Basic search (recursive by default, respects .gitignore)rg "TODO" src/
# Case-insensitiverg -i "error" logs/
# Search specific file typesrg --type py "import os"rg --type js "console\.log"rg -t rust "unsafe"
# Exclude file typesrg -T test "database"
# Show only filenamesrg -l "deprecated" src/
# Search with contextrg -C 3 "panic" src/
# Replace text (preview — does not modify files)rg "old_function" --replace "new_function"
# Count matchesrg -c "TODO" src/
# Fixed strings (literal match)rg -F "array[0]" src/
# Search hidden files and ignored filesrg --hidden --no-ignore "secret"
# JSON output (for scripting)rg --json "pattern" src/
# Multiline matchingrg -U "fn.*\{[\s\S]*?return" src/
# Statisticsrg --stats "TODO" src/ripgrep vs grep
| Feature | grep | ripgrep |
|---|---|---|
| Speed | Moderate | Very fast (parallelized) |
| Recursive | Need -r flag | Default |
| .gitignore | Ignored | Respected by default |
| Unicode | Varies | Full support |
| Regex safety | Can hang on bad patterns | Guaranteed linear time |
| Output | Plain | Colored, formatted |
| File types | --include glob | --type system |
sed — Stream Editor
sed processes text line by line, applying transformations. It is the standard tool for find-and-replace operations.
Substitution
# Basic substitution (first occurrence per line)sed 's/old/new/' file.txt
# Global substitution (all occurrences per line)sed 's/old/new/g' file.txt
# Case-insensitive substitution (GNU sed)sed 's/error/WARNING/gi' file.txt
# In-place editing (modifies the file)sed -i 's/old/new/g' file.txt # GNU sedsed -i '' 's/old/new/g' file.txt # macOS sed (needs empty backup suffix)
# In-place with backupsed -i.bak 's/old/new/g' file.txt # creates file.txt.bak
# Using different delimiters (useful when pattern contains /)sed 's|/usr/local|/opt|g' config.txtsed 's#http://#https://#g' urls.txt
# Substitute with capture groupssed 's/\(\w\+\)=\(\w\+\)/\2=\1/g' pairs.txt # swap key=valuesed -E 's/(\w+)=(\w+)/\2=\1/g' pairs.txt # ERE (easier syntax)
# Substitute only on matching linessed '/ERROR/s/retry=true/retry=false/g' config.txt
# Substitute on specific line numberssed '5s/old/new/' file.txt # line 5 onlysed '10,20s/old/new/g' file.txt # lines 10-20Addressing
# Print specific linessed -n '5p' file.txt # print line 5sed -n '10,20p' file.txt # print lines 10-20sed -n '/START/,/END/p' file.txt # print between patterns
# Delete linessed '5d' file.txt # delete line 5sed '/^$/d' file.txt # delete empty linessed '/^#/d' config.txt # delete comment linessed '1,5d' file.txt # delete first 5 lines
# Insert and appendsed '3i\New line before line 3' file.txt # insert before line 3sed '3a\New line after line 3' file.txt # append after line 3sed '/pattern/a\Added after pattern' file.txt
# Multiple commandssed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Using sed scriptsed -e ' s/error/ERROR/g s/warning/WARNING/g /^$/d /^#/d' input.txtPractical sed Examples
# Remove trailing whitespacesed 's/[[:space:]]*$//' file.txt
# Add line numberssed = file.txt | sed 'N; s/\n/\t/'
# Extract text between two markerssed -n '/BEGIN/,/END/p' file.txt
# Remove HTML tagssed 's/<[^>]*>//g' page.html
# Convert Windows line endings to Unixsed 's/\r$//' file.txt
# Double-space a filesed 'G' file.txt
# Remove duplicate consecutive lines (like uniq)sed '$!N; /^\(.*\)\n\1$/!P; D' file.txt
# Rename variables in source codesed -i -E 's/\boldName\b/newName/g' src/*.js
# Extract value from config filesed -n 's/^database_host=//p' config.iniawk — Pattern Scanning and Processing
awk is a full programming language for text processing. It excels at working with structured, column-oriented data.
Basic Structure
awk 'pattern { action }' file
- pattern: when to apply the action (regex, condition, BEGIN/END)- action: what to do (print, calculate, transform)- Fields: $0 = entire line, $1 = first field, $2 = second field, etc.- NR = record (line) number, NF = number of fieldsField Processing
# Print specific columns (default separator: whitespace)awk '{print $1, $3}' data.txt
# Custom field separatorawk -F',' '{print $1, $2}' data.csvawk -F':' '{print $1, $NF}' /etc/passwd # $NF = last field
# Print with formattingawk '{printf "%-20s %10s\n", $1, $2}' data.txt
# Print line numbersawk '{print NR, $0}' file.txt
# Print number of fields per lineawk '{print NF, $0}' file.txtPattern Matching
# Print lines matching a patternawk '/error/' application.log
# Print lines NOT matching a patternawk '!/debug/' application.log
# Print lines where a field matchesawk '$3 > 100' data.txt # third field > 100awk '$1 == "ERROR"' application.log # first field is ERRORawk '$NF ~ /\.py$/' files.txt # last field ends in .py
# Range patternsawk '/START/,/END/' file.txt # print between patterns
# Multiple patternsawk '/error/ {errors++} /warning/ {warnings++} END {print errors, warnings}' log.txtBEGIN and END Blocks
# BEGIN runs before processing, END runs afterawk 'BEGIN {print "Name", "Score"} {print $1, $2} END {print "---done---"}' data.txt
# Sum a columnawk '{sum += $3} END {print "Total:", sum}' sales.txt
# Calculate averageawk '{sum += $2; count++} END {print "Average:", sum/count}' scores.txt
# Find maximumawk 'BEGIN {max=0} $3 > max {max=$3; line=$0} END {print "Max:", max, line}' data.txtPractical awk Examples
# Process CSV with headerawk -F',' 'NR==1 {for(i=1;i<=NF;i++) header[i]=$i; next} {print header[1]"="$1, header[3]"="$3}' data.csv
# Log analysis: count HTTP status codesawk '{count[$9]++} END {for (code in count) print code, count[code]}' access.log | sort -rn -k2
# Calculate disk usage by directorydu -s */ | awk '{total += $1; print $0} END {print "Total:", total}'
# Top 10 most frequent IP addressesawk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
# Process /etc/passwd for user infoawk -F':' '$3 >= 1000 {printf "%-15s UID:%-6s Home:%s\n", $1, $3, $6}' /etc/passwd
# Multi-file processingawk 'FNR==1 {file++} {lines[file]++} END {for(f=1;f<=file;f++) print ARGV[f], lines[f], "lines"}' *.txt
# Transpose rows and columnsawk '{for(i=1;i<=NF;i++) a[i][NR]=$i} END {for(i=1;i in a;i++) {for(j=1;j<=NR;j++) printf "%s ", a[i][j]; print ""}}' data.txt
# Group by and aggregateawk -F',' '{sum[$1] += $3; count[$1]++} END {for (key in sum) printf "%s: total=%d avg=%.2f\n", key, sum[key], sum[key]/count[key]}' sales.csvawk Variables and Functions
# Built-in variablesawk '{ print "Line:", NR # record (line) number print "Fields:", NF # number of fields print "Full line:", $0 # entire line print "File:", FILENAME # current filename print "Separator:", FS # field separator print "Record Sep:", RS # record separator}' data.txt
# String functionsawk '{ print length($1) # string length print toupper($1) # uppercase print tolower($1) # lowercase print substr($1, 1, 3) # substring (pos, len) if (index($0, "error") > 0) print "Found error" gsub(/old/, "new", $0) # global substitution split($3, arr, ".") # split into array}' data.txt
# Regular expression matchingawk '$0 ~ /pattern/ {print}' file.txtawk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH)}' file.txtOther Essential Tools
tr — Translate Characters
# Convert to uppercaseecho "hello world" | tr 'a-z' 'A-Z'# HELLO WORLD
# Convert to lowercaseecho "HELLO" | tr 'A-Z' 'a-z'
# Delete charactersecho "hello 123 world" | tr -d '0-9'# hello world
# Squeeze repeated charactersecho "hello world" | tr -s ' '# hello world
# Replace newlines with spacestr '\n' ' ' < file.txt
# Remove non-printable characterstr -cd '[:print:]\n' < binary_file
# ROT13 encodingecho "hello" | tr 'a-zA-Z' 'n-za-mN-ZA-M'# uryybcut — Extract Columns
# Cut by character positionecho "Hello World" | cut -c1-5# Hello
# Cut by field (default delimiter: tab)cut -f1,3 data.tsv
# Cut with custom delimitercut -d',' -f1,3 data.csv
# Cut from field 3 to endcut -d':' -f3- /etc/passwd
# Combine with other toolsgrep "ERROR" log.txt | cut -d' ' -f1,2 # extract date and time from errorssort and uniq — Sort and Deduplicate
# Basic sortsort file.txt
# Numeric sortsort -n numbers.txt
# Reverse sortsort -r file.txt
# Sort by specific column (field 2, numeric)sort -t',' -k2 -n data.csv
# Sort by multiple keyssort -t',' -k1,1 -k2,2n data.csv # alphabetic by col1, then numeric by col2
# Remove duplicatessort -u file.txt
# Human-readable numeric sort (1K, 2M, 3G)du -sh */ | sort -h
# Count unique occurrences (uniq requires sorted input)sort file.txt | uniq -c | sort -rn
# Show only duplicatessort file.txt | uniq -d
# Show only unique lines (no duplicates)sort file.txt | uniq -u
# Count unique values in a CSV columncut -d',' -f3 data.csv | sort | uniq -c | sort -rnPractical Pipelines
Log Analysis
# Top 10 error messagesgrep "ERROR" app.log | awk '{$1=$2=$3=""; print}' | sort | uniq -c | sort -rn | head -10
# Requests per hourawk '{print substr($4, 2, 14)}' access.log | cut -d: -f1,2 | uniq -c
# Slow requests (response time > 1000ms)awk '$NF > 1000 {print $7, $NF"ms"}' access.log | sort -t' ' -k2 -rn | head -20
# Error rate per minuteawk '{minute=substr($4,2,17)} $9 >= 500 {errors[minute]++} {total[minute]++} END {for (m in total) printf "%s: %d/%d (%.1f%%)\n", m, errors[m]+0, total[m], (errors[m]+0)/total[m]*100}' access.logCSV Processing
# Skip header and extract columnstail -n +2 data.csv | cut -d',' -f1,3,5
# Filter rows where column 3 > 100awk -F',' 'NR==1 || $3 > 100' data.csv
# Convert CSV to TSVsed 's/,/\t/g' data.csv > data.tsv
# Sum a column (skip header)tail -n +2 sales.csv | awk -F',' '{sum += $4} END {printf "Total: $%.2f\n", sum}'
# Join two CSV files on a common columnjoin -t',' -1 1 -2 1 <(sort -t',' -k1 file1.csv) <(sort -t',' -k1 file2.csv)
# Pivot: count occurrences of each value in column 2tail -n +2 data.csv | awk -F',' '{count[$2]++} END {for(k in count) print k, count[k]}' | sort -k2 -rnData Extraction
# Extract all URLs from HTMLgrep -oE 'https?://[^ "]+' page.html
# Extract all email addresses from textgrep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' document.txt
# Extract JSON values (simple — use jq for complex JSON)grep -oP '"name":\s*"\K[^"]+' data.json
# Extract environment variables from a .env filegrep -v '^#' .env | grep -v '^$' | cut -d'=' -f1
# Find largest filesfind . -type f -exec du -h {} + | sort -rh | head -20
# Count lines of code by file typefind . -name "*.py" | xargs wc -l | sort -rnfind . -name "*.js" -not -path "*/node_modules/*" | xargs wc -l | sort -rnTool Selection Guide
| Task | Best Tool | Alternative |
|---|---|---|
| Search for pattern | grep / rg | awk '/pattern/' |
| Find and replace | sed 's/old/new/g' | awk '{gsub(...)} |
| Extract columns | cut or awk | sed with capture groups |
| Count occurrences | grep -c or `sort | uniq -c` |
| Transform characters | tr | sed 'y/abc/xyz/' |
| Complex field processing | awk | Python/Perl one-liner |
| Sort data | sort | awk (for custom logic) |
| Remove duplicates | sort -u or `sort | uniq` |
| JSON processing | jq | python -m json.tool |
| XML processing | xmlstarlet | python xml.etree |