Level 4Lesson 34⏱️ 80 min

Multi-Modal: Vision & Docs

Claude can see. Send images, screenshots, PDFs, and diagrams — and get intelligent answers back.

What Claude Can See

Claude 3+ is natively multi-modal. You can send images directly in the API request — no separate vision model or OCR pipeline required. Claude understands diagrams, screenshots, charts, handwriting, and documents.

Supported formats:
  • Images: JPEG, PNG, GIF, WebP (up to 20MB per image)
  • PDFs: sent as base64 or URL — Claude reads every page
  • Up to 20 images per request (across all content blocks)
  • No separate OCR step — Claude extracts text from images automatically

Sending an Image: Base64

import anthropic, base64

client = anthropic.Anthropic()

def analyse_image(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    # Detect media type from extension
    ext = image_path.rsplit(".", 1)[-1].lower()
    media_types = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
                   "png": "image/png", "gif": "image/gif", "webp": "image/webp"}
    media_type = media_types.get(ext, "image/jpeg")

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    }
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

# Examples
print(analyse_image("receipt.jpg",
      "Extract all line items with prices as JSON"))
print(analyse_image("diagram.png",
      "Explain this architecture diagram in plain English"))

Sending an Image: URL

# Faster — no need to download the image first
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://example.com/chart.png"
                }
            },
            {"type": "text",
             "text": "What trend does this chart show? "
                     "What would you predict for next quarter?"}
        ]
    }]
)

Processing PDFs

import anthropic, base64

client = anthropic.Anthropic()

def analyse_pdf(pdf_path: str, question: str) -> str:
    with open(pdf_path, "rb") as f:
        pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {
                        "type": "base64",
                        "media_type": "application/pdf",
                        "data": pdf_data,
                    }
                },
                {"type": "text", "text": question}
            ]
        }]
    )
    return response.content[0].text

# Real-world uses
review = analyse_pdf("contract.pdf",
    "List all clauses about liability and indemnification")
summary = analyse_pdf("annual_report.pdf",
    "What were the top 3 risks mentioned? "
    "Summarise each in one sentence")

Vision in a Next.js App

// app/api/analyse-image/route.ts
import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic()

export async function POST(req: Request) {
  const formData = await req.formData()
  const file = formData.get('image') as File
  const question = formData.get('question') as string

  const buffer = await file.arrayBuffer()
  const base64 = Buffer.from(buffer).toString('base64')

  const response = await client.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: [
        {
          type: 'image',
          source: { type: 'base64',
                    media_type: file.type as any,
                    data: base64 }
        },
        { type: 'text', text: question }
      ]
    }]
  })

  return Response.json({
    result: (response.content[0] as any).text
  })
}

High-Value Vision Use Cases

  • Receipt / invoice OCR — extract structured data from photos of documents
  • Screenshot-to-code — "Implement this UI in React/Tailwind"
  • Chart analysis — trend extraction, anomaly detection in graphs
  • Form processing — extract fields from filled paper forms
  • Diagram explanation — explain architecture diagrams, flowcharts, ERDs
  • Quality inspection — flag defects in product photos
  • Accessibility audit — describe UI screenshots for screen reader compliance checks

Hands-on: Receipt Analyser

Challenge: Build a receipt analyser that takes an image and returns structured expense data.

  1. Take a photo of any receipt (or find one online)
  2. Send it to Claude with the prompt: "Extract all line items, totals, tax, and merchant name as JSON"
  3. Parse the JSON response and display it in a clean table
  4. Add a "category" field — ask Claude to categorise each item (food, transport, office, etc.)

Stretch: Build a Next.js drag-and-drop page where users upload receipt images and get a monthly expense summary automatically.

Lesson 34 Quick Reference
Image block

type: "image", source: {type: "base64"|"url", ...}

Document block

type: "document", source: {media_type: "application/pdf", data: base64}

Max images

20 images per request, up to 20MB each

No OCR needed

Claude reads text in images natively — no preprocessing required

URL source

Pass a public image URL directly — faster than base64 for remote images

Vision + tools

Combine image input with tool use for structured extraction workflows