Production Deployment
Ship agents to production with proper architecture, containerization, scaling, cost optimization, and reliability.
Prerequisites
- 1Completed Guardrails, Observability, and Evaluation guides
- 2Familiarity with Docker and container orchestration
- 3Experience deploying web services to production
What you will learn
- Architecture patterns for production agent systems
- How to containerize and deploy agent services
- Scaling strategies for variable workloads
- Cost optimization techniques
- Error handling and graceful degradation
- A production readiness checklist
Architecture Considerations
A production agent system is more than just the agent code. Here is a reference architecture:
+------------------+
| API Gateway |
| (Rate Limiting) |
+--------+---------+
|
+--------+---------+
| Agent Service |
| (Stateless) |
+---+----+----+----+
| | |
+---------+ +--+--+ +---------+
| LLM API | |Tools| | State |
| (OpenAI/| |(MCP)| | Store |
|Anthropic)| +-----+ | (Redis) |
+---------+ +---------+
|
+--------+--------+
| Observability |
| (LangSmith/ |
| Langfuse) |
+-----------------+
Key principles:
- Stateless agent service — Store state in Redis or a database, not in memory. This allows horizontal scaling.
- Async processing — Long-running agent tasks should be processed via a job queue, not synchronous API calls.
- Separate tool services — Run MCP servers as separate services for independent scaling and deployment.
Containerization
Package your agent as a Docker container for consistent deployments:
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY prompts/ ./prompts/
# Non-root user for security
RUN useradd -m agent
USER agent
# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
And the FastAPI service:
# src/main.py
from fastapi import FastAPI, BackgroundTasks, HTTPException
from pydantic import BaseModel
import uuid
app = FastAPI()
class AgentRequest(BaseModel):
message: str
user_id: str
session_id: str | None = None
class AgentResponse(BaseModel):
task_id: str
status: str
@app.post("/agent/run")
async def run_agent(request: AgentRequest, background_tasks: BackgroundTasks):
task_id = str(uuid.uuid4())
# Process asynchronously for long-running tasks
background_tasks.add_task(
execute_agent_task, task_id, request.message, request.user_id
)
return AgentResponse(task_id=task_id, status="processing")
@app.get("/agent/status/{task_id}")
async def get_status(task_id: str):
result = await get_task_result(task_id)
if not result:
raise HTTPException(status_code=404, detail="Task not found")
return result
@app.get("/health")
async def health():
return {"status": "healthy"}
TypeScript Alternative (Next.js + Vercel AI SDK)
For TypeScript agents, a common deployment pattern uses Next.js API routes with the Vercel AI SDK:
// app/api/agent/route.ts (Next.js App Router)
import { streamText, tool } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";
export async function POST(req: Request) {
const { message } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
system: "You are a helpful assistant.",
prompt: message,
tools: {
search: tool({
description: "Search for information",
parameters: z.object({
query: z.string(),
}),
execute: async ({ query }) => {
// Call your search API
return `Results for: ${query}`;
},
}),
},
maxSteps: 5,
});
return result.toDataStreamResponse();
}
TypeScript Dockerfile for non-Vercel deployments:
# Dockerfile (TypeScript/Node.js)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER node
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1
EXPOSE 3000
CMD ["node", "dist/index.js"]
Scaling Strategies
Agent workloads are bursty and variable. Here are scaling strategies:
Horizontal Scaling
Since the agent service is stateless, you can run multiple replicas behind a load balancer:
# docker-compose.yml
services:
agent:
build: .
deploy:
replicas: 3
resources:
limits:
memory: 512M
environment:
- REDIS_URL=redis://redis:6379
- OPENAI_API_KEY=${OPENAI_API_KEY}
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
nginx:
image: nginx:alpine
ports:
- "80:80"
depends_on:
- agent
Queue-Based Processing
For variable workloads, use a job queue to decouple request intake from processing:
# Producer: API receives request and enqueues
import redis
import json
r = redis.Redis()
def enqueue_agent_task(task_id: str, message: str, user_id: str):
task = {"task_id": task_id, "message": message, "user_id": user_id}
r.lpush("agent:tasks", json.dumps(task))
# Consumer: Worker processes tasks from the queue
def worker_loop():
while True:
_, task_json = r.brpop("agent:tasks")
task = json.loads(task_json)
result = run_agent(task["message"])
r.set(f"agent:result:{task['task_id']}", json.dumps(result), ex=3600)
Cost Optimization
LLM API costs can grow quickly. Apply these optimization techniques:
- Model routing — Use cheaper models for simple tasks and expensive models only when needed:
def select_model(task_complexity: str) -> str:
"""Route to the appropriate model based on task complexity."""
model_map = {
"simple": "gpt-4o-mini", # $0.15/1M input tokens
"moderate": "gpt-4o", # $2.50/1M input tokens
"complex": "claude-sonnet-4-20250514", # For hardest tasks
}
return model_map.get(task_complexity, "gpt-4o-mini")
TypeScript equivalent using the Vercel AI SDK:
import { openai } from "@ai-sdk/openai";
import { anthropic } from "@ai-sdk/anthropic";
function selectModel(complexity: "simple" | "moderate" | "complex") {
const models = {
simple: openai("gpt-4o-mini"),
moderate: openai("gpt-4o"),
complex: anthropic("claude-sonnet-4-20250514"),
};
return models[complexity];
}
- Caching — Cache common queries and tool results:
import hashlib
def cached_llm_call(prompt: str, model: str) -> str:
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
# Check cache first
cached = redis_client.get(f"llm_cache:{cache_key}")
if cached:
return cached.decode()
# Make the API call
result = llm.invoke(prompt)
# Cache for 1 hour
redis_client.set(f"llm_cache:{cache_key}", result.content, ex=3600)
return result.content
- Prompt optimization — Shorter prompts cost less. Remove redundant instructions and use concise examples.
- Token limits — Set
max_tokenson every API call to prevent runaway generation. - Batch processing — For non-real-time tasks, use batch API endpoints that offer significant discounts.
Error Handling and Graceful Degradation
Production agents must handle failures gracefully:
import time
from functools import wraps
def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
"""Retry decorator with exponential backoff."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except RateLimitError:
delay = base_delay * (2 ** attempt)
time.sleep(delay)
except (APIConnectionError, APITimeoutError) as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
time.sleep(delay)
raise MaxRetriesExceeded(f"Failed after {max_retries} attempts")
return wrapper
return decorator
class AgentWithFallback:
"""Agent with graceful degradation."""
def __init__(self):
self.primary_model = "gpt-4o"
self.fallback_model = "gpt-4o-mini"
@retry_with_backoff(max_retries=3)
def run(self, message: str) -> str:
try:
return self._run_with_model(message, self.primary_model)
except (RateLimitError, APIError):
# Fall back to cheaper, more available model
return self._run_with_model(message, self.fallback_model)
except Exception as e:
# Last resort: return a helpful error message
return (
"I apologize, but I am currently unable to process your request. "
"Please try again in a few minutes."
)
Production Readiness Checklist
Before shipping your agent to production, verify every item on this checklist:
Security
- Input validation on all user-facing endpoints
- Output filtering for PII and secrets
- Tool call validation with risk-level classification
- API keys stored in secrets manager, not in code or environment files
- Non-root container user
Reliability
- Retry logic with exponential backoff for LLM API calls
- Model fallback chain (primary to secondary to error message)
- Circuit breaker for external tool calls
- Timeout limits on all agent executions
- Health check endpoint
Observability
- Structured logging for all agent events
- End-to-end tracing (LangSmith, Langfuse, or equivalent)
- Metrics dashboard with key agent KPIs
- Alerting configured for error rates, latency, and costs
Cost Control
- Token budget per request
- Rate limiting per user and globally
- Model routing based on task complexity
- Response caching for common queries
- Monitoring of daily/weekly spend with alerts
Testing
- Unit tests for tools and guardrails
- Integration tests with mock LLMs
- End-to-end evaluation benchmark suite
- Regression tests running in CI/CD
Common Mistakes to Avoid
- !Running agents as synchronous API calls instead of background tasks, leading to timeouts
- !Storing agent state in memory instead of an external store, breaking horizontal scaling
- !Not implementing retry logic for LLM API calls — rate limits and transient failures are common
- !Deploying without cost controls and waking up to a surprise bill
- !Skipping the production readiness checklist and fixing issues reactively instead of proactively
- !Not testing the deployment pipeline itself — your agent works locally but fails in the container