Level 4Lesson 37⏱️ 90 min

Advanced Prompt Evaluation

Stop guessing whether your prompts are good. Build a test suite and measure it - like real engineering.

Why Evaluation Matters

Most teams tune prompts by feel - tweak, test manually with 3 examples, ship. This works until it doesn't: a prompt change that improves one case breaks five others. Evaluation (evals) catches regressions, measures improvement, and gives you confidence to iterate fast.

What evals give you:

Quantitative score for your prompt before and after changes
Regression detection - did prompt v2 break anything v1 handled?
Model comparison - is Haiku good enough, or do you need Sonnet?
Coverage - are edge cases handled, not just happy path?

Building an Eval Dataset

An eval dataset is a list of (input, expected_output) pairs. Start small - 20 good examples beats 200 mediocre ones.

# evals/dataset.json
[
  {
    "id": "triage_001",
    "input": "My invoice from last month is wrong - I was charged twice",
    "expected": {"category": "billing", "priority": "high"},
    "tags": ["billing", "duplicate_charge"]
  },
  {
    "id": "triage_002",
    "input": "How do I export my data to CSV?",
    "expected": {"category": "technical", "priority": "low"},
    "tags": ["technical", "data_export"]
  },
  {
    "id": "triage_003",
    "input": "Your app crashes every time I open it on iOS 17",
    "expected": {"category": "technical", "priority": "high"},
    "tags": ["technical", "crash", "mobile"]
  }
  // ... 17 more examples covering edge cases
]

What makes a good eval dataset:

Cover your real distribution - sample from actual production data
Include edge cases explicitly (ambiguous, short, foreign language inputs)
Include "hard negatives" - inputs that look like one category but are another
Tag examples by type so you can see failure patterns per category

Running Evals: Exact Match + LLM-as-Judge

import json, anthropic
client = anthropic.Anthropic()

def run_prompt(user_input: str) -> dict:
    """The function under test"""
    r = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=128,
        system="""Classify the support ticket. Return JSON only:
{"category": "billing|technical|feature_request|other",
 "priority": "high|medium|low"}""",
        messages=[{"role": "user", "content": user_input}]
    )
    return json.loads(r.content[0].text)

def exact_match_eval(dataset: list) -> dict:
    """For structured outputs with known ground truth"""
    results = {"total": len(dataset), "correct": 0, "failures": []}
    for example in dataset:
        predicted = run_prompt(example["input"])
        if (predicted["category"] == example["expected"]["category"] and
            predicted["priority"] == example["expected"]["priority"]):
            results["correct"] += 1
        else:
            results["failures"].append({
                "id": example["id"],
                "input": example["input"],
                "expected": example["expected"],
                "got": predicted
            })
    results["accuracy"] = results["correct"] / results["total"]
    return results

LLM-as-Judge (for Open-Ended Outputs)

When outputs are prose (summaries, answers, emails), there's no single correct answer. Use Claude itself as the judge - it's surprisingly reliable with a well-structured rubric.

def llm_judge(question: str, ideal_answer: str,
              actual_answer: str) -> dict:
    """Score actual_answer against ideal on a 1-5 scale."""
    prompt = f"""You are an expert evaluator. Score the ACTUAL answer
against the IDEAL answer on these dimensions (1-5 each):
- Accuracy: Does it contain correct information?
- Completeness: Does it cover the key points?
- Clarity: Is it clear and well-written?
- Conciseness: Does it avoid unnecessary filler?

Question: {question}
Ideal answer: {ideal_answer}
Actual answer: {actual_answer}

Return JSON only:
{{"accuracy": N, "completeness": N, "clarity": N,
  "conciseness": N, "overall": N, "reasoning": "one sentence"}}"""

    r = client.messages.create(
        model="claude-opus-4-8",   # use a strong model as judge
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(r.content[0].text)

# Score an entire dataset
scores = [
    llm_judge(ex["question"], ex["ideal"], run_prompt(ex["question"]))
    for ex in open_ended_dataset
]
avg_overall = sum(s["overall"] for s in scores) / len(scores)
print(f"Average score: {avg_overall:.2f}/5")

Prompt Versioning & CI

# prompts/v2_support_triage.py
SYSTEM_PROMPT = """..."""  # version controlled in Git

# Run evals on every PR
# .github/workflows/evals.yml
# - python -m pytest evals/test_support_triage.py --threshold=0.85

import pytest
from evals.runner import exact_match_eval
from evals.dataset import load_dataset

def test_support_triage_accuracy():
    dataset = load_dataset("support_triage")
    results = exact_match_eval(dataset)
    assert results["accuracy"] >= 0.85, (
        f"Accuracy {results['accuracy']:.0%} below 85% threshold. "
        f"Failures: {results['failures']}"
    )

Eval workflow for prompt changes:

Run evals on current prompt (baseline score)
Make prompt changes in a branch
Run evals again - compare to baseline
Only merge if new score is equal or better
Add new failing cases to the dataset before fixing them

Hands-on: Build Your First Eval Suite

Challenge: Build a 20-example eval suite for any prompt you've built in this course.

Pick a prompt (support triage, summariser, extractor, etc.)
Create 20 test cases: 10 typical, 5 edge cases, 5 hard negatives
Write the exact_match_eval or llm_judge runner
Run it - what's your baseline accuracy?
Change one thing in the prompt and run again - did it improve?

Stretch: Run the same eval against claude-haiku vs claude-sonnet vs claude-opus. Plot the accuracy vs cost tradeoff to find the optimal model for your use case.

Lesson 37 Quick Reference

Eval dataset

20+ (input, expected) pairs - cover edge cases, not just happy path

Exact match

For structured outputs - compare predicted to expected dict/JSON

LLM-as-judge

Use strong Claude to score prose outputs on a rubric (1-5 scale)

Eval in CI

pytest + threshold check on every PR - catch regressions before they ship

Prompt versioning

Store prompts in Git - every change is reviewable and reversible

Model comparison eval

Run same dataset against Haiku/Sonnet/Opus to find cost/quality sweet spot

← L36: Responsible AI for Builders

Unlocks in ~23 min of reading

L4 Capstone: Ship Your AI Product →