Production AI Systems
Shipping Claude to real users: caching, rate limits, error handling, monitoring, and cost control.
The Gap Between Demo and Production
Your Claude prototype works great on your laptop. Production is different: real users send unexpected inputs, the API has rate limits, costs multiply, and when something breaks at 2am you need to know about it. This lesson covers everything between "it works" and "it ships."
- Prompt caching — cut costs by up to 90% on repeated context
- Rate limit handling — exponential backoff, queuing
- Error handling — retry logic, graceful degradation
- Observability — logging, latency tracking, error alerts
- Cost control — per-user limits, budget alerts
Prompt Caching
If your system prompt or RAG context is long and repeated across requests, prompt caching saves up to 90% on those input tokens. You pay full price the first time; subsequent requests with the same prefix are cached.
import anthropic
client = anthropic.Anthropic()
LARGE_SYSTEM = """You are an expert customer support agent for Acme Corp.
[... 5,000 words of product documentation, FAQ, policies ...]
"""
def support_reply(user_message: str) -> str:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
system=[
{
"type": "text",
"text": LARGE_SYSTEM,
"cache_control": {"type": "ephemeral"} # cache this!
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check cache hit in response
usage = response.usage
print(f"Cache read: {usage.cache_read_input_tokens} tokens")
print(f"Cache write: {usage.cache_creation_input_tokens} tokens")
return response.content[0].text
# First call: cache WRITE (full price for system prompt)
# Subsequent calls: cache READ (90% discount on system prompt tokens)- System prompt over 1,024 tokens (minimum cacheable size)
- RAG context that's the same across multiple user turns
- Few-shot examples that don't change per request
- Cache TTL is 5 minutes — keep requests coming to maintain the cache
Rate Limits & Retry Logic
import time, anthropic
from anthropic import RateLimitError, APIStatusError
client = anthropic.Anthropic()
def call_with_retry(max_retries: int = 5, **kwargs) -> str:
for attempt in range(max_retries):
try:
r = client.messages.create(**kwargs)
return r.content[0].text
except RateLimitError:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait}s (attempt {attempt+1})")
time.sleep(wait)
except APIStatusError as e:
if e.status_code >= 500: # server error — retry
time.sleep(2 ** attempt)
else:
raise # 4xx — don't retry (bad request, auth, etc.)
raise RuntimeError("Max retries exceeded")Observability: What to Log
# Every production Claude call should log:
{
"request_id": "uuid",
"timestamp": "2025-01-15T10:23:11Z",
"user_id": "user_abc",
"feature": "support_chat",
"model": "claude-opus-4-5",
"input_tokens": 1240,
"output_tokens": 312,
"cache_read_tokens": 980, # how much was cached
"latency_ms": 1842,
"cost_usd": 0.000823,
"stop_reason": "end_turn", # or "max_tokens" (bad!)
"error": null # or error message
}
# Alert on:
# - stop_reason == "max_tokens" (increase max_tokens or truncate input)
# - latency_ms > 10000 (p99 spike — investigate)
# - error rate > 1% (API issues or prompt problems)
# - daily cost > threshold (budget breach)Per-User Rate Limiting
# Simple Redis-based rate limiter (or use Upstash free tier)
import redis, time
r = redis.Redis(host="localhost", port=6379)
def check_rate_limit(user_id: str,
limit: int = 20,
window_seconds: int = 60) -> bool:
"""Return True if request is allowed, False if rate limited."""
key = f"ratelimit:{user_id}"
pipe = r.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()
return count <= limit
# In your API handler:
def handle_chat(user_id: str, message: str):
if not check_rate_limit(user_id):
return {"error": "Too many requests. Try again in a minute."}, 429
return {"reply": call_with_retry(model="claude-opus-4-5",
max_tokens=512,
messages=[{"role":"user",
"content": message}])}Graceful Degradation
- Cached responses — for high-traffic, repeated queries, cache Claude's output
- Haiku fallback — if Opus is slow, retry with Haiku at lower quality
- Queue + async — for non-realtime tasks, queue the job and notify when done
- Partial responses — stream partial output so users see progress even if it's slow
cache_control: {type: "ephemeral"} — 90% discount on repeated context
5 minutes — keep requests flowing to maintain cache hit
Exponential backoff: 2^attempt seconds; max 5 retries
anthropic.RateLimitError — back off; APIStatusError >= 500 also retry
Bad sign — response was cut off. Increase max_tokens or shorten input
Redis INCR + EXPIRE for sliding window rate limiting