advanced20 min readGuide 9 of 12Updated Jun 15, 2025

Guardrails & Safety

Implement safety measures including input validation, output filtering, content moderation, and human-in-the-loop checkpoints.

Prerequisites

  • 1Completed the Getting Started and Prompt Engineering guides
  • 2Familiarity with at least one agent framework
  • 3Understanding of agent tool use and the agent loop

What you will learn

  • How to validate and sanitize agent inputs
  • Output filtering and content moderation techniques
  • Tool call validation to prevent dangerous actions
  • Rate limiting and cost controls
  • Human-in-the-loop checkpoint patterns
  • PII detection and redaction strategies

Why Guardrails Matter

AI agents have more autonomy than chatbots. They call tools, access data, and make decisions. Without guardrails, an agent can:

  • Execute harmful tool calls — Deleting data, sending unauthorized emails, or modifying production systems.
  • Leak sensitive information — Exposing PII, API keys, or internal data in responses.
  • Run up costs — An agent stuck in a loop can make thousands of API calls before anyone notices.
  • Produce harmful content — Without moderation, agents can generate toxic, biased, or misleading output.

Guardrails are not optional. They are a core requirement for any agent that interacts with real users or real systems.

Input Validation

Validate all inputs before they reach the LLM. This prevents prompt injection and ensures data quality:

from pydantic import BaseModel, Field, field_validator

class UserQuery(BaseModel):
    """Validated user input for the agent."""
    message: str = Field(..., min_length=1, max_length=10000)
    user_id: str = Field(..., pattern=r"^[a-zA-Z0-9_-]+$")

    @field_validator("message")
    @classmethod
    def check_message(cls, v: str) -> str:
        # Block common prompt injection attempts
        injection_patterns = [
            "ignore previous instructions",
            "ignore all instructions",
            "system prompt",
            "you are now",
            "new instructions:",
        ]
        lower = v.lower()
        for pattern in injection_patterns:
            if pattern in lower:
                raise ValueError(f"Input contains blocked pattern: {pattern}")
        return v

# Usage
try:
    query = UserQuery(message=user_input, user_id=user_id)
    result = agent.run(query.message)
except ValidationError as e:
    return {"error": "Invalid input", "details": str(e)}

This is a basic first layer. For production systems, combine this with model-based classifiers that detect more sophisticated injection attempts.

Output Filtering

Always filter agent output before returning it to the user:

import re

class OutputFilter:
    """Filter agent output for safety."""

    # Patterns that should never appear in output
    BLOCKED_PATTERNS = [
        r"sk-[a-zA-Z0-9]{20,}",          # OpenAI API keys
        r"sk-ant-[a-zA-Z0-9]{20,}",       # Anthropic API keys
        r"AKIA[0-9A-Z]{16}",              # AWS access keys
        r"-----BEGIN.*PRIVATE KEY-----",   # Private keys
    ]

    PII_PATTERNS = [
        r"d{3}-d{2}-d{4}",         # SSN
        r"d{16}",                     # Credit card (basic)
        r"[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}",  # Email
    ]

    @classmethod
    def filter(cls, text: str) -> str:
        # Remove API keys and secrets
        for pattern in cls.BLOCKED_PATTERNS:
            text = re.sub(pattern, "[REDACTED]", text, flags=re.IGNORECASE)

        # Redact PII
        for pattern in cls.PII_PATTERNS:
            text = re.sub(pattern, "[PII REDACTED]", text, flags=re.IGNORECASE)

        return text

# Apply to every agent response
raw_output = agent.run(user_message)
safe_output = OutputFilter.filter(raw_output)
return safe_output

Tool Call Validation

Not all tool calls should be executed blindly. Implement a validation layer between the model's tool call decision and actual execution:

from enum import Enum

class RiskLevel(Enum):
    LOW = "low"       # Read-only operations
    MEDIUM = "medium" # Reversible writes
    HIGH = "high"     # Irreversible or sensitive actions

# Define risk levels for each tool
TOOL_RISK_MAP = {
    "search_database": RiskLevel.LOW,
    "read_file": RiskLevel.LOW,
    "send_email": RiskLevel.MEDIUM,
    "update_record": RiskLevel.MEDIUM,
    "delete_record": RiskLevel.HIGH,
    "execute_code": RiskLevel.HIGH,
}

class ToolCallGuard:
    def __init__(self, max_calls_per_minute: int = 30):
        self.call_count = 0
        self.max_calls = max_calls_per_minute

    def validate(self, tool_name: str, args: dict) -> bool:
        risk = TOOL_RISK_MAP.get(tool_name, RiskLevel.HIGH)

        # Rate limit check
        self.call_count += 1
        if self.call_count > self.max_calls:
            raise RateLimitError("Too many tool calls")

        # High-risk tools require human approval
        if risk == RiskLevel.HIGH:
            approved = request_human_approval(
                f"Agent wants to call {tool_name} with args: {args}"
            )
            if not approved:
                return False

        # Validate specific tool arguments
        if tool_name == "execute_code":
            if "import os" in args.get("code", ""):
                raise SecurityError("Code contains blocked import")

        return True

Rate Limiting and Cost Controls

Prevent runaway agents from draining your budget:

import time
from dataclasses import dataclass, field

@dataclass
class AgentBudget:
    """Track and limit agent resource usage."""
    max_llm_calls: int = 20
    max_tool_calls: int = 50
    max_tokens: int = 100_000
    max_execution_seconds: int = 300

    # Tracking
    llm_calls: int = field(default=0, init=False)
    tool_calls: int = field(default=0, init=False)
    tokens_used: int = field(default=0, init=False)
    start_time: float = field(default_factory=time.time, init=False)

    def check_llm_call(self, tokens: int) -> bool:
        self.llm_calls += 1
        self.tokens_used += tokens

        if self.llm_calls > self.max_llm_calls:
            raise BudgetExceededError("Max LLM calls exceeded")
        if self.tokens_used > self.max_tokens:
            raise BudgetExceededError("Token budget exceeded")
        if time.time() - self.start_time > self.max_execution_seconds:
            raise BudgetExceededError("Execution time exceeded")

        return True

    def check_tool_call(self) -> bool:
        self.tool_calls += 1
        if self.tool_calls > self.max_tool_calls:
            raise BudgetExceededError("Max tool calls exceeded")
        return True

# Usage
budget = AgentBudget(max_llm_calls=10, max_tokens=50_000)
# Pass budget to your agent loop and check before each call

Human-in-the-Loop Checkpoints

For high-stakes operations, add checkpoints where a human must approve before the agent continues:

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict, Literal

class AgentState(TypedDict):
    task: str
    plan: str
    result: str
    human_approved: bool

def plan_node(state: AgentState) -> dict:
    # Agent creates a plan
    plan = llm.invoke(f"Create a plan for: {state['task']}")
    return {"plan": plan.content}

def human_review_node(state: AgentState) -> dict:
    # This node pauses execution and waits for human input
    # The human reviews the plan in the UI and approves or rejects
    return {}  # LangGraph interrupts here

def execute_node(state: AgentState) -> dict:
    result = llm.invoke(f"Execute this plan: {state['plan']}")
    return {"result": result.content}

def should_execute(state: AgentState) -> Literal["execute", "end"]:
    if state.get("human_approved"):
        return "execute"
    return "end"

graph = StateGraph(AgentState)
graph.add_node("plan", plan_node)
graph.add_node("human_review", human_review_node)
graph.add_node("execute", execute_node)

graph.add_edge(START, "plan")
graph.add_edge("plan", "human_review")
graph.add_conditional_edges("human_review", should_execute)
graph.add_edge("execute", END)

# Compile with interrupt_before to pause at human_review
app = graph.compile(
    checkpointer=MemorySaver(),
    interrupt_before=["human_review"],
)

This pattern is essential for operations like sending emails, making payments, modifying databases, or any action that cannot easily be undone.

Common Mistakes to Avoid

  • !Relying only on prompt-based guardrails — always add programmatic validation too
  • !Not rate-limiting tool calls, allowing agents to run up costs or get stuck in loops
  • !Applying guardrails to the final output but not to intermediate steps
  • !Making guardrails so strict that the agent cannot accomplish its task
  • !Not logging guardrail triggers — you need this data to improve your filters over time

Explore Related Content