Guardrails & Safety
Implement safety measures including input validation, output filtering, content moderation, and human-in-the-loop checkpoints.
Prerequisites
- 1Completed the Getting Started and Prompt Engineering guides
- 2Familiarity with at least one agent framework
- 3Understanding of agent tool use and the agent loop
What you will learn
- How to validate and sanitize agent inputs
- Output filtering and content moderation techniques
- Tool call validation to prevent dangerous actions
- Rate limiting and cost controls
- Human-in-the-loop checkpoint patterns
- PII detection and redaction strategies
Why Guardrails Matter
AI agents have more autonomy than chatbots. They call tools, access data, and make decisions. Without guardrails, an agent can:
- Execute harmful tool calls — Deleting data, sending unauthorized emails, or modifying production systems.
- Leak sensitive information — Exposing PII, API keys, or internal data in responses.
- Run up costs — An agent stuck in a loop can make thousands of API calls before anyone notices.
- Produce harmful content — Without moderation, agents can generate toxic, biased, or misleading output.
Guardrails are not optional. They are a core requirement for any agent that interacts with real users or real systems.
Input Validation
Validate all inputs before they reach the LLM. This prevents prompt injection and ensures data quality:
from pydantic import BaseModel, Field, field_validator
class UserQuery(BaseModel):
"""Validated user input for the agent."""
message: str = Field(..., min_length=1, max_length=10000)
user_id: str = Field(..., pattern=r"^[a-zA-Z0-9_-]+$")
@field_validator("message")
@classmethod
def check_message(cls, v: str) -> str:
# Block common prompt injection attempts
injection_patterns = [
"ignore previous instructions",
"ignore all instructions",
"system prompt",
"you are now",
"new instructions:",
]
lower = v.lower()
for pattern in injection_patterns:
if pattern in lower:
raise ValueError(f"Input contains blocked pattern: {pattern}")
return v
# Usage
try:
query = UserQuery(message=user_input, user_id=user_id)
result = agent.run(query.message)
except ValidationError as e:
return {"error": "Invalid input", "details": str(e)}
This is a basic first layer. For production systems, combine this with model-based classifiers that detect more sophisticated injection attempts.
Output Filtering
Always filter agent output before returning it to the user:
import re
class OutputFilter:
"""Filter agent output for safety."""
# Patterns that should never appear in output
BLOCKED_PATTERNS = [
r"sk-[a-zA-Z0-9]{20,}", # OpenAI API keys
r"sk-ant-[a-zA-Z0-9]{20,}", # Anthropic API keys
r"AKIA[0-9A-Z]{16}", # AWS access keys
r"-----BEGIN.*PRIVATE KEY-----", # Private keys
]
PII_PATTERNS = [
r"d{3}-d{2}-d{4}", # SSN
r"d{16}", # Credit card (basic)
r"[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}", # Email
]
@classmethod
def filter(cls, text: str) -> str:
# Remove API keys and secrets
for pattern in cls.BLOCKED_PATTERNS:
text = re.sub(pattern, "[REDACTED]", text, flags=re.IGNORECASE)
# Redact PII
for pattern in cls.PII_PATTERNS:
text = re.sub(pattern, "[PII REDACTED]", text, flags=re.IGNORECASE)
return text
# Apply to every agent response
raw_output = agent.run(user_message)
safe_output = OutputFilter.filter(raw_output)
return safe_output
Tool Call Validation
Not all tool calls should be executed blindly. Implement a validation layer between the model's tool call decision and actual execution:
from enum import Enum
class RiskLevel(Enum):
LOW = "low" # Read-only operations
MEDIUM = "medium" # Reversible writes
HIGH = "high" # Irreversible or sensitive actions
# Define risk levels for each tool
TOOL_RISK_MAP = {
"search_database": RiskLevel.LOW,
"read_file": RiskLevel.LOW,
"send_email": RiskLevel.MEDIUM,
"update_record": RiskLevel.MEDIUM,
"delete_record": RiskLevel.HIGH,
"execute_code": RiskLevel.HIGH,
}
class ToolCallGuard:
def __init__(self, max_calls_per_minute: int = 30):
self.call_count = 0
self.max_calls = max_calls_per_minute
def validate(self, tool_name: str, args: dict) -> bool:
risk = TOOL_RISK_MAP.get(tool_name, RiskLevel.HIGH)
# Rate limit check
self.call_count += 1
if self.call_count > self.max_calls:
raise RateLimitError("Too many tool calls")
# High-risk tools require human approval
if risk == RiskLevel.HIGH:
approved = request_human_approval(
f"Agent wants to call {tool_name} with args: {args}"
)
if not approved:
return False
# Validate specific tool arguments
if tool_name == "execute_code":
if "import os" in args.get("code", ""):
raise SecurityError("Code contains blocked import")
return True
Rate Limiting and Cost Controls
Prevent runaway agents from draining your budget:
import time
from dataclasses import dataclass, field
@dataclass
class AgentBudget:
"""Track and limit agent resource usage."""
max_llm_calls: int = 20
max_tool_calls: int = 50
max_tokens: int = 100_000
max_execution_seconds: int = 300
# Tracking
llm_calls: int = field(default=0, init=False)
tool_calls: int = field(default=0, init=False)
tokens_used: int = field(default=0, init=False)
start_time: float = field(default_factory=time.time, init=False)
def check_llm_call(self, tokens: int) -> bool:
self.llm_calls += 1
self.tokens_used += tokens
if self.llm_calls > self.max_llm_calls:
raise BudgetExceededError("Max LLM calls exceeded")
if self.tokens_used > self.max_tokens:
raise BudgetExceededError("Token budget exceeded")
if time.time() - self.start_time > self.max_execution_seconds:
raise BudgetExceededError("Execution time exceeded")
return True
def check_tool_call(self) -> bool:
self.tool_calls += 1
if self.tool_calls > self.max_tool_calls:
raise BudgetExceededError("Max tool calls exceeded")
return True
# Usage
budget = AgentBudget(max_llm_calls=10, max_tokens=50_000)
# Pass budget to your agent loop and check before each call
Human-in-the-Loop Checkpoints
For high-stakes operations, add checkpoints where a human must approve before the agent continues:
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict, Literal
class AgentState(TypedDict):
task: str
plan: str
result: str
human_approved: bool
def plan_node(state: AgentState) -> dict:
# Agent creates a plan
plan = llm.invoke(f"Create a plan for: {state['task']}")
return {"plan": plan.content}
def human_review_node(state: AgentState) -> dict:
# This node pauses execution and waits for human input
# The human reviews the plan in the UI and approves or rejects
return {} # LangGraph interrupts here
def execute_node(state: AgentState) -> dict:
result = llm.invoke(f"Execute this plan: {state['plan']}")
return {"result": result.content}
def should_execute(state: AgentState) -> Literal["execute", "end"]:
if state.get("human_approved"):
return "execute"
return "end"
graph = StateGraph(AgentState)
graph.add_node("plan", plan_node)
graph.add_node("human_review", human_review_node)
graph.add_node("execute", execute_node)
graph.add_edge(START, "plan")
graph.add_edge("plan", "human_review")
graph.add_conditional_edges("human_review", should_execute)
graph.add_edge("execute", END)
# Compile with interrupt_before to pause at human_review
app = graph.compile(
checkpointer=MemorySaver(),
interrupt_before=["human_review"],
)
This pattern is essential for operations like sending emails, making payments, modifying databases, or any action that cannot easily be undone.
Common Mistakes to Avoid
- !Relying only on prompt-based guardrails — always add programmatic validation too
- !Not rate-limiting tool calls, allowing agents to run up costs or get stuck in loops
- !Applying guardrails to the final output but not to intermediate steps
- !Making guardrails so strict that the agent cannot accomplish its task
- !Not logging guardrail triggers — you need this data to improve your filters over time