advanced20 min readGuide 10 of 12Updated Jun 15, 2025

Observability & Monitoring

Monitor agent behavior, trace execution, log tool calls, and set up alerting for production agent systems.

Prerequisites

1Built and deployed at least one agent
2Familiarity with logging and monitoring concepts
3Understanding of the agent loop and tool use

What you will learn

Why traditional monitoring is not enough for agents
How to trace end-to-end agent execution
Key metrics to track for agent systems
Setting up LangSmith or Langfuse for observability
Alerting strategies for agent failures

Why Agent Observability Is Different

Traditional application monitoring tracks request/response latency, error rates, and resource usage. Agent systems need all of that plus:

Reasoning traces — What did the model think at each step? Why did it choose that tool?
Tool call chains — Which tools were called, in what order, with what arguments, and what did they return?
Token usage per step — A single agent run might involve 5-10 LLM calls. You need per-call visibility.
Branching and loops — Multi-agent systems have non-linear execution paths. You need graph-aware tracing.
Quality metrics — Was the output correct? Did the user find it helpful? Traditional uptime metrics do not capture this.

Without observability, debugging a misbehaving agent is like debugging a program without logs — you are guessing.

Structured Logging

Start with structured, JSON-formatted logging for every agent event:

import logging
import json
from datetime import datetime, timezone

class AgentLogger:
    def __init__(self, agent_name: str):
        self.logger = logging.getLogger(f"agent.{agent_name}")
        self.agent_name = agent_name

    def log_event(self, event_type: str, data: dict):
        event = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "agent": self.agent_name,
            "event": event_type,
            **data,
        }
        self.logger.info(json.dumps(event))

    def log_llm_call(self, model: str, tokens_in: int, tokens_out: int, latency_ms: float):
        self.log_event("llm_call", {
            "model": model,
            "tokens_input": tokens_in,
            "tokens_output": tokens_out,
            "latency_ms": latency_ms,
        })

    def log_tool_call(self, tool: str, args: dict, result: str, latency_ms: float):
        self.log_event("tool_call", {
            "tool": tool,
            "arguments": args,
            "result_length": len(result),
            "latency_ms": latency_ms,
        })

    def log_error(self, error: str, step: str):
        self.log_event("error", {
            "error": error,
            "step": step,
        })

# Usage
logger = AgentLogger("ResearchAgent")
logger.log_tool_call("search_web", {"query": "MCP protocol"}, "Found 10 results", 230.5)

Tracing with LangSmith

LangSmith provides production-grade tracing for LangGraph and LangChain applications. Set it up in two steps:

# 1. Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="ls-..."
export LANGCHAIN_PROJECT="my-agent-project"

# 2. That is it. LangChain/LangGraph will automatically trace all calls.

For non-LangChain frameworks, use the LangSmith SDK directly:

from langsmith import traceable
from langsmith.run_helpers import trace

@traceable(name="agent_step")
def research_step(query: str) -> str:
    """This function is automatically traced in LangSmith."""
    result = llm.invoke(query)
    return result.content

# Or use the context manager for more control
with trace("full_agent_run", project_name="my-agent") as rt:
    rt.metadata = {"user_id": "user-123", "task_type": "research"}

    research = research_step("AI frameworks 2025")
    # All nested calls are captured in the trace

LangSmith gives you a visual trace tree showing every LLM call, tool invocation, and state update with full input/output data.

Tracing with Langfuse

Langfuse is an open-source alternative to LangSmith. It can be self-hosted or used as a managed service:

from langfuse import observe, get_client

# Configure via environment variables:
# LANGFUSE_PUBLIC_KEY="pk-..."
# LANGFUSE_SECRET_KEY="sk-..."
# LANGFUSE_HOST="https://cloud.langfuse.com"

@observe()
def run_agent(user_message: str) -> str:
    # Update the current trace with metadata
    langfuse = get_client()
    langfuse.update_current_trace(
        user_id="user-123",
        metadata={"source": "api"},
    )

    # Each nested @observe() creates a span in the trace
    research = do_research(user_message)
    response = generate_response(research)
    return response

@observe()
def do_research(query: str) -> str:
    # This creates a child span
    result = llm.invoke(query)
    langfuse = get_client()
    langfuse.update_current_observation(
        metadata={"tokens": result.usage.total_tokens}
    )
    return result.content

Key Metrics to Track

Set up dashboards and alerts for these essential agent metrics:

Metric	Description	Alert Threshold
Task completion rate	Percentage of tasks completed successfully	< 90%
Avg. LLM calls per task	How many model calls per task	> 15 (possible loop)
Avg. latency (end-to-end)	Total time from input to output	> 30s
Token usage per task	Total tokens consumed per task	> 50k (cost concern)
Tool error rate	Percentage of tool calls that fail	> 5%
Guardrail trigger rate	How often safety filters activate	> 10% (suspicious)
Cost per task	Dollar cost of each agent execution	> $0.50 per task

Alerting Strategy

Set up tiered alerts for different severity levels:

# Example: Alerting with a simple monitor
from dataclasses import dataclass

@dataclass
class AlertRule:
    name: str
    metric: str
    threshold: float
    severity: str  # "info", "warning", "critical"

ALERT_RULES = [
    AlertRule("High Error Rate", "tool_error_rate", 0.05, "warning"),
    AlertRule("Agent Loop Detected", "llm_calls_per_task", 15, "critical"),
    AlertRule("High Cost", "cost_per_task_usd", 0.50, "warning"),
    AlertRule("Slow Response", "latency_p95_seconds", 30, "info"),
    AlertRule("Low Completion", "completion_rate", 0.90, "critical"),
]

def check_alerts(metrics: dict) -> list[AlertRule]:
    triggered = []
    for rule in ALERT_RULES:
        value = metrics.get(rule.metric, 0)
        if rule.metric == "completion_rate":
            if value < rule.threshold:
                triggered.append(rule)
        else:
            if value > rule.threshold:
                triggered.append(rule)
    return triggered

Integrate these alerts with your existing monitoring stack — Datadog, PagerDuty, Slack, or whatever your team uses. The key is to not ignore agent-specific metrics just because your traditional infrastructure metrics look healthy.

Common Mistakes to Avoid

!Only monitoring infrastructure metrics (CPU, memory) while ignoring agent-level metrics (completion rate, tool errors)
!Not tracing individual LLM calls — you need to see the reasoning at each step to debug issues
!Logging too little (no tool call args) or too much (full LLM responses including user data) without PII handling
!Setting alerts too loosely — by the time you notice, the agent has already made hundreds of bad decisions
!Not correlating agent metrics with business outcomes — task completion rate means nothing without quality measurement

Recommended Next Steps

Evaluation & Testing Production Deployment Guardrails & Safety

Explore Related Content

Concept

Edit on GitHub