Level 4Lesson 35⏱️ 90 min

Production AI Systems

Shipping Claude to real users: caching, rate limits, error handling, monitoring, and cost control.

The Gap Between Demo and Production

Your Claude prototype works great on your laptop. Production is different: real users send unexpected inputs, the API has rate limits, costs multiply, and when something breaks at 2am you need to know about it. This lesson covers everything between "it works" and "it ships."

Production checklist overview:

Prompt caching - cut costs by up to 90% on repeated context
Rate limit handling - exponential backoff, queuing
Error handling - retry logic, graceful degradation
Observability - logging, latency tracking, error alerts
Cost control - per-user limits, budget alerts

Prompt Caching

If your system prompt or RAG context is long and repeated across requests, prompt caching saves up to 90% on those input tokens. You pay full price the first time; subsequent requests with the same prefix are cached.

import anthropic
client = anthropic.Anthropic()

LARGE_SYSTEM = """You are an expert customer support agent for Acme Corp.
[... 5,000 words of product documentation, FAQ, policies ...]
"""

def support_reply(user_message: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": LARGE_SYSTEM,
                "cache_control": {"type": "ephemeral"}  # cache this!
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )
    # Check cache hit in response
    usage = response.usage
    print(f"Cache read: {usage.cache_read_input_tokens} tokens")
    print(f"Cache write: {usage.cache_creation_input_tokens} tokens")
    return response.content[0].text

# First call: cache WRITE (full price for system prompt)
# Subsequent calls: cache READ (90% discount on system prompt tokens)

When to use prompt caching:

System prompt over 1,024 tokens (minimum cacheable size)
RAG context that's the same across multiple user turns
Few-shot examples that don't change per request
Cache TTL is 5 minutes - keep requests coming to maintain the cache

Rate Limits & Retry Logic

import time, anthropic
from anthropic import RateLimitError, APIStatusError

client = anthropic.Anthropic()

def call_with_retry(max_retries: int = 5, **kwargs) -> str:
    for attempt in range(max_retries):
        try:
            r = client.messages.create(**kwargs)
            return r.content[0].text

        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt   # 1, 2, 4, 8, 16 seconds
            print(f"Rate limited. Waiting {wait}s (attempt {attempt+1})")
            time.sleep(wait)

        except APIStatusError as e:
            if e.status_code >= 500:  # server error - retry
                time.sleep(2 ** attempt)
            else:
                raise  # 4xx - don't retry (bad request, auth, etc.)

    raise RuntimeError("Max retries exceeded")

Observability: What to Log

# Every production Claude call should log:
{
  "request_id": "uuid",
  "timestamp": "2025-01-15T10:23:11Z",
  "user_id": "user_abc",
  "feature": "support_chat",
  "model": "claude-opus-4-8",
  "input_tokens": 1240,
  "output_tokens": 312,
  "cache_read_tokens": 980,       # how much was cached
  "latency_ms": 1842,
  "cost_usd": 0.000823,
  "stop_reason": "end_turn",      # or "max_tokens" (bad!)
  "error": null                   # or error message
}

# Alert on:
# - stop_reason == "max_tokens" (increase max_tokens or truncate input)
# - latency_ms > 10000 (p99 spike - investigate)
# - error rate > 1% (API issues or prompt problems)
# - daily cost > threshold (budget breach)

Per-User Rate Limiting

# Simple Redis-based rate limiter (or use Upstash free tier)
import redis, time

r = redis.Redis(host="localhost", port=6379)

def check_rate_limit(user_id: str,
                     limit: int = 20,
                     window_seconds: int = 60) -> bool:
    """Return True if request is allowed, False if rate limited."""
    key = f"ratelimit:{user_id}"
    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds)
    count, _ = pipe.execute()
    return count <= limit

# In your API handler:
def handle_chat(user_id: str, message: str):
    if not check_rate_limit(user_id):
        return {"error": "Too many requests. Try again in a minute."}, 429
    return {"reply": call_with_retry(model="claude-opus-4-8",
                                      max_tokens=512,
                                      messages=[{"role":"user",
                                                 "content": message}])}

Graceful Degradation

When Claude is slow or unavailable, have a fallback plan:

Cached responses - for high-traffic, repeated queries, cache Claude's output
Haiku fallback - if Opus is slow, retry with Haiku at lower quality
Queue + async - for non-realtime tasks, queue the job and notify when done
Partial responses - stream partial output so users see progress even if it's slow

Lesson 35 Quick Reference

Prompt caching

cache_control: {type: "ephemeral"} - 90% discount on repeated context

Cache TTL

5 minutes - keep requests flowing to maintain cache hit

Retry logic

Exponential backoff: 2^attempt seconds; max 5 retries

RateLimitError

anthropic.RateLimitError - back off; APIStatusError >= 500 also retry

stop_reason max_tokens

Bad sign - response was cut off. Increase max_tokens or shorten input

Per-user limits

Redis INCR + EXPIRE for sliding window rate limiting

← L34: Multi-Modal: Vision & Docs

Unlocks in ~23 min of reading

L36: Responsible AI for Builders →