advanced25 min readGuide 12 of 12Updated Jun 15, 2025

Production Deployment

Ship agents to production with proper architecture, containerization, scaling, cost optimization, and reliability.

Prerequisites

  • 1Completed Guardrails, Observability, and Evaluation guides
  • 2Familiarity with Docker and container orchestration
  • 3Experience deploying web services to production

What you will learn

  • Architecture patterns for production agent systems
  • How to containerize and deploy agent services
  • Scaling strategies for variable workloads
  • Cost optimization techniques
  • Error handling and graceful degradation
  • A production readiness checklist

Architecture Considerations

A production agent system is more than just the agent code. Here is a reference architecture:

                    +------------------+
                    |   API Gateway    |
                    |  (Rate Limiting) |
                    +--------+---------+
                             |
                    +--------+---------+
                    |  Agent Service   |
                    |  (Stateless)     |
                    +---+----+----+----+
                        |    |    |
              +---------+ +--+--+ +---------+
              | LLM API | |Tools| |  State  |
              | (OpenAI/| |(MCP)| |  Store  |
              |Anthropic)| +-----+ | (Redis) |
              +---------+         +---------+
                                       |
                              +--------+--------+
                              |  Observability   |
                              | (LangSmith/      |
                              |  Langfuse)       |
                              +-----------------+

Key principles:

  • Stateless agent service — Store state in Redis or a database, not in memory. This allows horizontal scaling.
  • Async processing — Long-running agent tasks should be processed via a job queue, not synchronous API calls.
  • Separate tool services — Run MCP servers as separate services for independent scaling and deployment.

Containerization

Package your agent as a Docker container for consistent deployments:

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY src/ ./src/
COPY prompts/ ./prompts/

# Non-root user for security
RUN useradd -m agent
USER agent

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

And the FastAPI service:

# src/main.py
from fastapi import FastAPI, BackgroundTasks, HTTPException
from pydantic import BaseModel
import uuid

app = FastAPI()

class AgentRequest(BaseModel):
    message: str
    user_id: str
    session_id: str | None = None

class AgentResponse(BaseModel):
    task_id: str
    status: str

@app.post("/agent/run")
async def run_agent(request: AgentRequest, background_tasks: BackgroundTasks):
    task_id = str(uuid.uuid4())

    # Process asynchronously for long-running tasks
    background_tasks.add_task(
        execute_agent_task, task_id, request.message, request.user_id
    )

    return AgentResponse(task_id=task_id, status="processing")

@app.get("/agent/status/{task_id}")
async def get_status(task_id: str):
    result = await get_task_result(task_id)
    if not result:
        raise HTTPException(status_code=404, detail="Task not found")
    return result

@app.get("/health")
async def health():
    return {"status": "healthy"}

TypeScript Alternative (Next.js + Vercel AI SDK)

For TypeScript agents, a common deployment pattern uses Next.js API routes with the Vercel AI SDK:

// app/api/agent/route.ts (Next.js App Router)
import { streamText, tool } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";

export async function POST(req: Request) {
  const { message } = await req.json();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    system: "You are a helpful assistant.",
    prompt: message,
    tools: {
      search: tool({
        description: "Search for information",
        parameters: z.object({
          query: z.string(),
        }),
        execute: async ({ query }) => {
          // Call your search API
          return `Results for: ${query}`;
        },
      }),
    },
    maxSteps: 5,
  });

  return result.toDataStreamResponse();
}

TypeScript Dockerfile for non-Vercel deployments:

# Dockerfile (TypeScript/Node.js)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./

USER node
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1

EXPOSE 3000
CMD ["node", "dist/index.js"]

Scaling Strategies

Agent workloads are bursty and variable. Here are scaling strategies:

Horizontal Scaling

Since the agent service is stateless, you can run multiple replicas behind a load balancer:

# docker-compose.yml
services:
  agent:
    build: .
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 512M
    environment:
      - REDIS_URL=redis://redis:6379
      - OPENAI_API_KEY=${OPENAI_API_KEY}

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    depends_on:
      - agent

Queue-Based Processing

For variable workloads, use a job queue to decouple request intake from processing:

# Producer: API receives request and enqueues
import redis
import json

r = redis.Redis()

def enqueue_agent_task(task_id: str, message: str, user_id: str):
    task = {"task_id": task_id, "message": message, "user_id": user_id}
    r.lpush("agent:tasks", json.dumps(task))

# Consumer: Worker processes tasks from the queue
def worker_loop():
    while True:
        _, task_json = r.brpop("agent:tasks")
        task = json.loads(task_json)
        result = run_agent(task["message"])
        r.set(f"agent:result:{task['task_id']}", json.dumps(result), ex=3600)

Cost Optimization

LLM API costs can grow quickly. Apply these optimization techniques:

  • Model routing — Use cheaper models for simple tasks and expensive models only when needed:
def select_model(task_complexity: str) -> str:
    """Route to the appropriate model based on task complexity."""
    model_map = {
        "simple": "gpt-4o-mini",      # $0.15/1M input tokens
        "moderate": "gpt-4o",          # $2.50/1M input tokens
        "complex": "claude-sonnet-4-20250514",  # For hardest tasks
    }
    return model_map.get(task_complexity, "gpt-4o-mini")

TypeScript equivalent using the Vercel AI SDK:

import { openai } from "@ai-sdk/openai";
import { anthropic } from "@ai-sdk/anthropic";

function selectModel(complexity: "simple" | "moderate" | "complex") {
  const models = {
    simple: openai("gpt-4o-mini"),
    moderate: openai("gpt-4o"),
    complex: anthropic("claude-sonnet-4-20250514"),
  };
  return models[complexity];
}
  • Caching — Cache common queries and tool results:
import hashlib

def cached_llm_call(prompt: str, model: str) -> str:
    cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()

    # Check cache first
    cached = redis_client.get(f"llm_cache:{cache_key}")
    if cached:
        return cached.decode()

    # Make the API call
    result = llm.invoke(prompt)

    # Cache for 1 hour
    redis_client.set(f"llm_cache:{cache_key}", result.content, ex=3600)
    return result.content
  • Prompt optimization — Shorter prompts cost less. Remove redundant instructions and use concise examples.
  • Token limits — Set max_tokens on every API call to prevent runaway generation.
  • Batch processing — For non-real-time tasks, use batch API endpoints that offer significant discounts.

Error Handling and Graceful Degradation

Production agents must handle failures gracefully:

import time
from functools import wraps

def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
    """Retry decorator with exponential backoff."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except RateLimitError:
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
                except (APIConnectionError, APITimeoutError) as e:
                    if attempt == max_retries - 1:
                        raise
                    delay = base_delay * (2 ** attempt)
                    time.sleep(delay)
            raise MaxRetriesExceeded(f"Failed after {max_retries} attempts")
        return wrapper
    return decorator

class AgentWithFallback:
    """Agent with graceful degradation."""

    def __init__(self):
        self.primary_model = "gpt-4o"
        self.fallback_model = "gpt-4o-mini"

    @retry_with_backoff(max_retries=3)
    def run(self, message: str) -> str:
        try:
            return self._run_with_model(message, self.primary_model)
        except (RateLimitError, APIError):
            # Fall back to cheaper, more available model
            return self._run_with_model(message, self.fallback_model)
        except Exception as e:
            # Last resort: return a helpful error message
            return (
                "I apologize, but I am currently unable to process your request. "
                "Please try again in a few minutes."
            )

Production Readiness Checklist

Before shipping your agent to production, verify every item on this checklist:

Security

  • Input validation on all user-facing endpoints
  • Output filtering for PII and secrets
  • Tool call validation with risk-level classification
  • API keys stored in secrets manager, not in code or environment files
  • Non-root container user

Reliability

  • Retry logic with exponential backoff for LLM API calls
  • Model fallback chain (primary to secondary to error message)
  • Circuit breaker for external tool calls
  • Timeout limits on all agent executions
  • Health check endpoint

Observability

  • Structured logging for all agent events
  • End-to-end tracing (LangSmith, Langfuse, or equivalent)
  • Metrics dashboard with key agent KPIs
  • Alerting configured for error rates, latency, and costs

Cost Control

  • Token budget per request
  • Rate limiting per user and globally
  • Model routing based on task complexity
  • Response caching for common queries
  • Monitoring of daily/weekly spend with alerts

Testing

  • Unit tests for tools and guardrails
  • Integration tests with mock LLMs
  • End-to-end evaluation benchmark suite
  • Regression tests running in CI/CD

Common Mistakes to Avoid

  • !Running agents as synchronous API calls instead of background tasks, leading to timeouts
  • !Storing agent state in memory instead of an external store, breaking horizontal scaling
  • !Not implementing retry logic for LLM API calls — rate limits and transient failures are common
  • !Deploying without cost controls and waking up to a surprise bill
  • !Skipping the production readiness checklist and fixing issues reactively instead of proactively
  • !Not testing the deployment pipeline itself — your agent works locally but fails in the container

Explore Related Content