Observability & Monitoring
Monitor agent behavior, trace execution, log tool calls, and set up alerting for production agent systems.
Prerequisites
- 1Built and deployed at least one agent
- 2Familiarity with logging and monitoring concepts
- 3Understanding of the agent loop and tool use
What you will learn
- Why traditional monitoring is not enough for agents
- How to trace end-to-end agent execution
- Key metrics to track for agent systems
- Setting up LangSmith or Langfuse for observability
- Alerting strategies for agent failures
Why Agent Observability Is Different
Traditional application monitoring tracks request/response latency, error rates, and resource usage. Agent systems need all of that plus:
- Reasoning traces — What did the model think at each step? Why did it choose that tool?
- Tool call chains — Which tools were called, in what order, with what arguments, and what did they return?
- Token usage per step — A single agent run might involve 5-10 LLM calls. You need per-call visibility.
- Branching and loops — Multi-agent systems have non-linear execution paths. You need graph-aware tracing.
- Quality metrics — Was the output correct? Did the user find it helpful? Traditional uptime metrics do not capture this.
Without observability, debugging a misbehaving agent is like debugging a program without logs — you are guessing.
Structured Logging
Start with structured, JSON-formatted logging for every agent event:
import logging
import json
from datetime import datetime, timezone
class AgentLogger:
def __init__(self, agent_name: str):
self.logger = logging.getLogger(f"agent.{agent_name}")
self.agent_name = agent_name
def log_event(self, event_type: str, data: dict):
event = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"agent": self.agent_name,
"event": event_type,
**data,
}
self.logger.info(json.dumps(event))
def log_llm_call(self, model: str, tokens_in: int, tokens_out: int, latency_ms: float):
self.log_event("llm_call", {
"model": model,
"tokens_input": tokens_in,
"tokens_output": tokens_out,
"latency_ms": latency_ms,
})
def log_tool_call(self, tool: str, args: dict, result: str, latency_ms: float):
self.log_event("tool_call", {
"tool": tool,
"arguments": args,
"result_length": len(result),
"latency_ms": latency_ms,
})
def log_error(self, error: str, step: str):
self.log_event("error", {
"error": error,
"step": step,
})
# Usage
logger = AgentLogger("ResearchAgent")
logger.log_tool_call("search_web", {"query": "MCP protocol"}, "Found 10 results", 230.5)
Tracing with LangSmith
LangSmith provides production-grade tracing for LangGraph and LangChain applications. Set it up in two steps:
# 1. Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="ls-..."
export LANGCHAIN_PROJECT="my-agent-project"
# 2. That is it. LangChain/LangGraph will automatically trace all calls.
For non-LangChain frameworks, use the LangSmith SDK directly:
from langsmith import traceable
from langsmith.run_helpers import trace
@traceable(name="agent_step")
def research_step(query: str) -> str:
"""This function is automatically traced in LangSmith."""
result = llm.invoke(query)
return result.content
# Or use the context manager for more control
with trace("full_agent_run", project_name="my-agent") as rt:
rt.metadata = {"user_id": "user-123", "task_type": "research"}
research = research_step("AI frameworks 2025")
# All nested calls are captured in the trace
LangSmith gives you a visual trace tree showing every LLM call, tool invocation, and state update with full input/output data.
Tracing with Langfuse
Langfuse is an open-source alternative to LangSmith. It can be self-hosted or used as a managed service:
from langfuse import observe, get_client
# Configure via environment variables:
# LANGFUSE_PUBLIC_KEY="pk-..."
# LANGFUSE_SECRET_KEY="sk-..."
# LANGFUSE_HOST="https://cloud.langfuse.com"
@observe()
def run_agent(user_message: str) -> str:
# Update the current trace with metadata
langfuse = get_client()
langfuse.update_current_trace(
user_id="user-123",
metadata={"source": "api"},
)
# Each nested @observe() creates a span in the trace
research = do_research(user_message)
response = generate_response(research)
return response
@observe()
def do_research(query: str) -> str:
# This creates a child span
result = llm.invoke(query)
langfuse = get_client()
langfuse.update_current_observation(
metadata={"tokens": result.usage.total_tokens}
)
return result.content
Key Metrics to Track
Set up dashboards and alerts for these essential agent metrics:
| Metric | Description | Alert Threshold |
|---|---|---|
| Task completion rate | Percentage of tasks completed successfully | < 90% |
| Avg. LLM calls per task | How many model calls per task | > 15 (possible loop) |
| Avg. latency (end-to-end) | Total time from input to output | > 30s |
| Token usage per task | Total tokens consumed per task | > 50k (cost concern) |
| Tool error rate | Percentage of tool calls that fail | > 5% |
| Guardrail trigger rate | How often safety filters activate | > 10% (suspicious) |
| Cost per task | Dollar cost of each agent execution | > $0.50 per task |
Alerting Strategy
Set up tiered alerts for different severity levels:
# Example: Alerting with a simple monitor
from dataclasses import dataclass
@dataclass
class AlertRule:
name: str
metric: str
threshold: float
severity: str # "info", "warning", "critical"
ALERT_RULES = [
AlertRule("High Error Rate", "tool_error_rate", 0.05, "warning"),
AlertRule("Agent Loop Detected", "llm_calls_per_task", 15, "critical"),
AlertRule("High Cost", "cost_per_task_usd", 0.50, "warning"),
AlertRule("Slow Response", "latency_p95_seconds", 30, "info"),
AlertRule("Low Completion", "completion_rate", 0.90, "critical"),
]
def check_alerts(metrics: dict) -> list[AlertRule]:
triggered = []
for rule in ALERT_RULES:
value = metrics.get(rule.metric, 0)
if rule.metric == "completion_rate":
if value < rule.threshold:
triggered.append(rule)
else:
if value > rule.threshold:
triggered.append(rule)
return triggered
Integrate these alerts with your existing monitoring stack — Datadog, PagerDuty, Slack, or whatever your team uses. The key is to not ignore agent-specific metrics just because your traditional infrastructure metrics look healthy.
Common Mistakes to Avoid
- !Only monitoring infrastructure metrics (CPU, memory) while ignoring agent-level metrics (completion rate, tool errors)
- !Not tracing individual LLM calls — you need to see the reasoning at each step to debug issues
- !Logging too little (no tool call args) or too much (full LLM responses including user data) without PII handling
- !Setting alerts too loosely — by the time you notice, the agent has already made hundreds of bad decisions
- !Not correlating agent metrics with business outcomes — task completion rate means nothing without quality measurement