advanced25 min readGuide 12 of 12Updated Jun 15, 2025

Production Deployment

Ship agents to production with proper architecture, containerization, scaling, cost optimization, and reliability.

Prerequisites

1Completed Guardrails, Observability, and Evaluation guides
2Familiarity with Docker and container orchestration
3Experience deploying web services to production

What you will learn

Architecture patterns for production agent systems
How to containerize and deploy agent services
Scaling strategies for variable workloads
Cost optimization techniques
Error handling and graceful degradation
A production readiness checklist

Architecture Considerations

A production agent system is more than just the agent code. Here is a reference architecture:

                    +------------------+
                    |   API Gateway    |
                    |  (Rate Limiting) |
                    +--------+---------+
                             |
                    +--------+---------+
                    |  Agent Service   |
                    |  (Stateless)     |
                    +---+----+----+----+
                        |    |    |
              +---------+ +--+--+ +---------+
              | LLM API | |Tools| |  State  |
              | (OpenAI/| |(MCP)| |  Store  |
              |Anthropic)| +-----+ | (Redis) |
              +---------+         +---------+
                                       |
                              +--------+--------+
                              |  Observability   |
                              | (LangSmith/      |
                              |  Langfuse)       |
                              +-----------------+

Key principles:

Stateless agent service — Store state in Redis or a database, not in memory. This allows horizontal scaling.
Async processing — Long-running agent tasks should be processed via a job queue, not synchronous API calls.
Separate tool services — Run MCP servers as separate services for independent scaling and deployment.

Containerization

Package your agent as a Docker container for consistent deployments:

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY prompts/ ./prompts/

# Non-root user for security
RUN useradd -m agent
USER agent

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

And the FastAPI service:

# src/main.py
from fastapi import FastAPI, BackgroundTasks, HTTPException
from pydantic import BaseModel
import uuid

app = FastAPI()

class AgentRequest(BaseModel):
    message: str
    user_id: str
    session_id: str | None = None

class AgentResponse(BaseModel):
    task_id: str
    status: str

@app.post("/agent/run")
async def run_agent(request: AgentRequest, background_tasks: BackgroundTasks):
    task_id = str(uuid.uuid4())

    # Process asynchronously for long-running tasks
    background_tasks.add_task(
        execute_agent_task, task_id, request.message, request.user_id
    )

    return AgentResponse(task_id=task_id, status="processing")

@app.get("/agent/status/{task_id}")
async def get_status(task_id: str):
    result = await get_task_result(task_id)
    if not result:
        raise HTTPException(status_code=404, detail="Task not found")
    return result

@app.get("/health")
async def health():
    return {"status": "healthy"}

TypeScript Alternative (Next.js + Vercel AI SDK)

For TypeScript agents, a common deployment pattern uses Next.js API routes with the Vercel AI SDK:

// app/api/agent/route.ts (Next.js App Router)
import { streamText, tool } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

export async function POST(req: Request) {
  const { message } = await req.json();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    system: "You are a helpful assistant.",
    prompt: message,
    tools: {
      search: tool({
        description: "Search for information",
        parameters: z.object({
          query: z.string(),
        }),
        execute: async ({ query }) => {
          // Call your search API
          return `Results for: ${query}`;
        },
      }),
    },
    maxSteps: 5,
  });

  return result.toDataStreamResponse();
}

TypeScript Dockerfile for non-Vercel deployments:

# Dockerfile (TypeScript/Node.js)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER node
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1

EXPOSE 3000
CMD ["node", "dist/index.js"]

Scaling Strategies

Agent workloads are bursty and variable. Here are scaling strategies:

Horizontal Scaling

Since the agent service is stateless, you can run multiple replicas behind a load balancer:

# docker-compose.yml
services:
  agent:
    build: .
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 512M
    environment:
      - REDIS_URL=redis://redis:6379
      - OPENAI_API_KEY=${OPENAI_API_KEY}

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    depends_on:
      - agent

Queue-Based Processing

For variable workloads, use a job queue to decouple request intake from processing:

# Producer: API receives request and enqueues
import redis
import json

r = redis.Redis()

def enqueue_agent_task(task_id: str, message: str, user_id: str):
    task = {"task_id": task_id, "message": message, "user_id": user_id}
    r.lpush("agent:tasks", json.dumps(task))

# Consumer: Worker processes tasks from the queue
def worker_loop():
    while True:
        _, task_json = r.brpop("agent:tasks")
        task = json.loads(task_json)
        result = run_agent(task["message"])
        r.set(f"agent:result:{task['task_id']}", json.dumps(result), ex=3600)

Cost Optimization

LLM API costs can grow quickly. Apply these optimization techniques:

Model routing — Use cheaper models for simple tasks and expensive models only when needed:

def select_model(task_complexity: str) -> str:
    """Route to the appropriate model based on task complexity."""
    model_map = {
        "simple": "gpt-4o-mini",      # $0.15/1M input tokens
        "moderate": "gpt-4o",          # $2.50/1M input tokens
        "complex": "claude-sonnet-4-20250514",  # For hardest tasks
    }
    return model_map.get(task_complexity, "gpt-4o-mini")

TypeScript equivalent using the Vercel AI SDK:

import { openai } from "@ai-sdk/openai";
import { anthropic } from "@ai-sdk/anthropic";

function selectModel(complexity: "simple" | "moderate" | "complex") {
  const models = {
    simple: openai("gpt-4o-mini"),
    moderate: openai("gpt-4o"),
    complex: anthropic("claude-sonnet-4-20250514"),
  };
  return models[complexity];
}

Caching — Cache common queries and tool results:

import hashlib

def cached_llm_call(prompt: str, model: str) -> str:
    cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()

    # Check cache first
    cached = redis_client.get(f"llm_cache:{cache_key}")
    if cached:
        return cached.decode()

    # Make the API call
    result = llm.invoke(prompt)

    # Cache for 1 hour
    redis_client.set(f"llm_cache:{cache_key}", result.content, ex=3600)
    return result.content

Prompt optimization — Shorter prompts cost less. Remove redundant instructions and use concise examples.
Token limits — Set max_tokens on every API call to prevent runaway generation.
Batch processing — For non-real-time tasks, use batch API endpoints that offer significant discounts.

Error Handling and Graceful Degradation

Production agents must handle failures gracefully:

import time
from functools import wraps

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
    """Retry decorator with exponential backoff."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except RateLimitError:
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
                except (APIConnectionError, APITimeoutError) as e:
                    if attempt == max_retries - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
            raise MaxRetriesExceeded(f"Failed after {max_retries} attempts")
        return wrapper
    return decorator

class AgentWithFallback:
    """Agent with graceful degradation."""

    def __init__(self):
        self.primary_model = "gpt-4o"
        self.fallback_model = "gpt-4o-mini"

    @retry_with_backoff(max_retries=3)
    def run(self, message: str) -> str:
        try:
            return self._run_with_model(message, self.primary_model)
        except (RateLimitError, APIError):
            # Fall back to cheaper, more available model
            return self._run_with_model(message, self.fallback_model)
        except Exception as e:
            # Last resort: return a helpful error message
            return (
                "I apologize, but I am currently unable to process your request. "
                "Please try again in a few minutes."
            )

Production Readiness Checklist

Before shipping your agent to production, verify every item on this checklist:

Security

Input validation on all user-facing endpoints
Output filtering for PII and secrets
Tool call validation with risk-level classification
API keys stored in secrets manager, not in code or environment files
Non-root container user

Reliability

Retry logic with exponential backoff for LLM API calls
Model fallback chain (primary to secondary to error message)
Circuit breaker for external tool calls
Timeout limits on all agent executions
Health check endpoint

Observability

Structured logging for all agent events
End-to-end tracing (LangSmith, Langfuse, or equivalent)
Metrics dashboard with key agent KPIs
Alerting configured for error rates, latency, and costs

Cost Control

Token budget per request
Rate limiting per user and globally
Model routing based on task complexity
Response caching for common queries
Monitoring of daily/weekly spend with alerts

Testing

Unit tests for tools and guardrails
Integration tests with mock LLMs
End-to-end evaluation benchmark suite
Regression tests running in CI/CD

Common Mistakes to Avoid

!Running agents as synchronous API calls instead of background tasks, leading to timeouts
!Storing agent state in memory instead of an external store, breaking horizontal scaling
!Not implementing retry logic for LLM API calls — rate limits and transient failures are common
!Deploying without cost controls and waking up to a surprise bill
!Skipping the production readiness checklist and fixing issues reactively instead of proactively
!Not testing the deployment pipeline itself — your agent works locally but fails in the container

Recommended Next Steps

Guardrails & Safety Observability & Monitoring Explore Frameworks

Explore Related Content

Concept

Memory Systems

Short-term, long-term, and episodic memory architectures that give agents persistent knowledge.

Pattern

Supervisor Pattern

A central orchestrator agent decomposes tasks and coordinates specialized worker agents.

Framework

LangGraph

Stateful agents as graphs

PreviousEvaluation & Testing

Edit on GitHub