advanced20 min readGuide 10 of 12Updated Jun 15, 2025

Observability & Monitoring

Monitor agent behavior, trace execution, log tool calls, and set up alerting for production agent systems.

Prerequisites

  • 1Built and deployed at least one agent
  • 2Familiarity with logging and monitoring concepts
  • 3Understanding of the agent loop and tool use

What you will learn

  • Why traditional monitoring is not enough for agents
  • How to trace end-to-end agent execution
  • Key metrics to track for agent systems
  • Setting up LangSmith or Langfuse for observability
  • Alerting strategies for agent failures

Why Agent Observability Is Different

Traditional application monitoring tracks request/response latency, error rates, and resource usage. Agent systems need all of that plus:

  • Reasoning traces — What did the model think at each step? Why did it choose that tool?
  • Tool call chains — Which tools were called, in what order, with what arguments, and what did they return?
  • Token usage per step — A single agent run might involve 5-10 LLM calls. You need per-call visibility.
  • Branching and loops — Multi-agent systems have non-linear execution paths. You need graph-aware tracing.
  • Quality metrics — Was the output correct? Did the user find it helpful? Traditional uptime metrics do not capture this.

Without observability, debugging a misbehaving agent is like debugging a program without logs — you are guessing.

Structured Logging

Start with structured, JSON-formatted logging for every agent event:

import logging
import json
from datetime import datetime, timezone

class AgentLogger:
    def __init__(self, agent_name: str):
        self.logger = logging.getLogger(f"agent.{agent_name}")
        self.agent_name = agent_name

    def log_event(self, event_type: str, data: dict):
        event = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "agent": self.agent_name,
            "event": event_type,
            **data,
        }
        self.logger.info(json.dumps(event))

    def log_llm_call(self, model: str, tokens_in: int, tokens_out: int, latency_ms: float):
        self.log_event("llm_call", {
            "model": model,
            "tokens_input": tokens_in,
            "tokens_output": tokens_out,
            "latency_ms": latency_ms,
        })

    def log_tool_call(self, tool: str, args: dict, result: str, latency_ms: float):
        self.log_event("tool_call", {
            "tool": tool,
            "arguments": args,
            "result_length": len(result),
            "latency_ms": latency_ms,
        })

    def log_error(self, error: str, step: str):
        self.log_event("error", {
            "error": error,
            "step": step,
        })

# Usage
logger = AgentLogger("ResearchAgent")
logger.log_tool_call("search_web", {"query": "MCP protocol"}, "Found 10 results", 230.5)

Tracing with LangSmith

LangSmith provides production-grade tracing for LangGraph and LangChain applications. Set it up in two steps:

# 1. Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="ls-..."
export LANGCHAIN_PROJECT="my-agent-project"

# 2. That is it. LangChain/LangGraph will automatically trace all calls.

For non-LangChain frameworks, use the LangSmith SDK directly:

from langsmith import traceable
from langsmith.run_helpers import trace

@traceable(name="agent_step")
def research_step(query: str) -> str:
    """This function is automatically traced in LangSmith."""
    result = llm.invoke(query)
    return result.content

# Or use the context manager for more control
with trace("full_agent_run", project_name="my-agent") as rt:
    rt.metadata = {"user_id": "user-123", "task_type": "research"}

    research = research_step("AI frameworks 2025")
    # All nested calls are captured in the trace

LangSmith gives you a visual trace tree showing every LLM call, tool invocation, and state update with full input/output data.

Tracing with Langfuse

Langfuse is an open-source alternative to LangSmith. It can be self-hosted or used as a managed service:

from langfuse import observe, get_client

# Configure via environment variables:
# LANGFUSE_PUBLIC_KEY="pk-..."
# LANGFUSE_SECRET_KEY="sk-..."
# LANGFUSE_HOST="https://cloud.langfuse.com"

@observe()
def run_agent(user_message: str) -> str:
    # Update the current trace with metadata
    langfuse = get_client()
    langfuse.update_current_trace(
        user_id="user-123",
        metadata={"source": "api"},
    )

    # Each nested @observe() creates a span in the trace
    research = do_research(user_message)
    response = generate_response(research)
    return response

@observe()
def do_research(query: str) -> str:
    # This creates a child span
    result = llm.invoke(query)
    langfuse = get_client()
    langfuse.update_current_observation(
        metadata={"tokens": result.usage.total_tokens}
    )
    return result.content

Key Metrics to Track

Set up dashboards and alerts for these essential agent metrics:

MetricDescriptionAlert Threshold
Task completion ratePercentage of tasks completed successfully< 90%
Avg. LLM calls per taskHow many model calls per task> 15 (possible loop)
Avg. latency (end-to-end)Total time from input to output> 30s
Token usage per taskTotal tokens consumed per task> 50k (cost concern)
Tool error ratePercentage of tool calls that fail> 5%
Guardrail trigger rateHow often safety filters activate> 10% (suspicious)
Cost per taskDollar cost of each agent execution> $0.50 per task

Alerting Strategy

Set up tiered alerts for different severity levels:

# Example: Alerting with a simple monitor
from dataclasses import dataclass

@dataclass
class AlertRule:
    name: str
    metric: str
    threshold: float
    severity: str  # "info", "warning", "critical"

ALERT_RULES = [
    AlertRule("High Error Rate", "tool_error_rate", 0.05, "warning"),
    AlertRule("Agent Loop Detected", "llm_calls_per_task", 15, "critical"),
    AlertRule("High Cost", "cost_per_task_usd", 0.50, "warning"),
    AlertRule("Slow Response", "latency_p95_seconds", 30, "info"),
    AlertRule("Low Completion", "completion_rate", 0.90, "critical"),
]

def check_alerts(metrics: dict) -> list[AlertRule]:
    triggered = []
    for rule in ALERT_RULES:
        value = metrics.get(rule.metric, 0)
        if rule.metric == "completion_rate":
            if value < rule.threshold:
                triggered.append(rule)
        else:
            if value > rule.threshold:
                triggered.append(rule)
    return triggered

Integrate these alerts with your existing monitoring stack — Datadog, PagerDuty, Slack, or whatever your team uses. The key is to not ignore agent-specific metrics just because your traditional infrastructure metrics look healthy.

Common Mistakes to Avoid

  • !Only monitoring infrastructure metrics (CPU, memory) while ignoring agent-level metrics (completion rate, tool errors)
  • !Not tracing individual LLM calls — you need to see the reasoning at each step to debug issues
  • !Logging too little (no tool call args) or too much (full LLM responses including user data) without PII handling
  • !Setting alerts too loosely — by the time you notice, the agent has already made hundreds of bad decisions
  • !Not correlating agent metrics with business outcomes — task completion rate means nothing without quality measurement

Explore Related Content