Multi-Modal: Vision & Docs
Claude can see. Send images, screenshots, PDFs, and diagrams — and get intelligent answers back.
What Claude Can See
Claude 3+ is natively multi-modal. You can send images directly in the API request — no separate vision model or OCR pipeline required. Claude understands diagrams, screenshots, charts, handwriting, and documents.
- Images: JPEG, PNG, GIF, WebP (up to 20MB per image)
- PDFs: sent as base64 or URL — Claude reads every page
- Up to 20 images per request (across all content blocks)
- No separate OCR step — Claude extracts text from images automatically
Sending an Image: Base64
import anthropic, base64
client = anthropic.Anthropic()
def analyse_image(image_path: str, question: str) -> str:
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
# Detect media type from extension
ext = image_path.rsplit(".", 1)[-1].lower()
media_types = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
"png": "image/png", "gif": "image/gif", "webp": "image/webp"}
media_type = media_types.get(ext, "image/jpeg")
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
}
},
{"type": "text", "text": question}
]
}]
)
return response.content[0].text
# Examples
print(analyse_image("receipt.jpg",
"Extract all line items with prices as JSON"))
print(analyse_image("diagram.png",
"Explain this architecture diagram in plain English"))Sending an Image: URL
# Faster — no need to download the image first
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png"
}
},
{"type": "text",
"text": "What trend does this chart show? "
"What would you predict for next quarter?"}
]
}]
)Processing PDFs
import anthropic, base64
client = anthropic.Anthropic()
def analyse_pdf(pdf_path: str, question: str) -> str:
with open(pdf_path, "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {
"type": "base64",
"media_type": "application/pdf",
"data": pdf_data,
}
},
{"type": "text", "text": question}
]
}]
)
return response.content[0].text
# Real-world uses
review = analyse_pdf("contract.pdf",
"List all clauses about liability and indemnification")
summary = analyse_pdf("annual_report.pdf",
"What were the top 3 risks mentioned? "
"Summarise each in one sentence")Vision in a Next.js App
// app/api/analyse-image/route.ts
import Anthropic from '@anthropic-ai/sdk'
const client = new Anthropic()
export async function POST(req: Request) {
const formData = await req.formData()
const file = formData.get('image') as File
const question = formData.get('question') as string
const buffer = await file.arrayBuffer()
const base64 = Buffer.from(buffer).toString('base64')
const response = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages: [{
role: 'user',
content: [
{
type: 'image',
source: { type: 'base64',
media_type: file.type as any,
data: base64 }
},
{ type: 'text', text: question }
]
}]
})
return Response.json({
result: (response.content[0] as any).text
})
}High-Value Vision Use Cases
- Receipt / invoice OCR — extract structured data from photos of documents
- Screenshot-to-code — "Implement this UI in React/Tailwind"
- Chart analysis — trend extraction, anomaly detection in graphs
- Form processing — extract fields from filled paper forms
- Diagram explanation — explain architecture diagrams, flowcharts, ERDs
- Quality inspection — flag defects in product photos
- Accessibility audit — describe UI screenshots for screen reader compliance checks
Hands-on: Receipt Analyser
Challenge: Build a receipt analyser that takes an image and returns structured expense data.
- Take a photo of any receipt (or find one online)
- Send it to Claude with the prompt: "Extract all line items, totals, tax, and merchant name as JSON"
- Parse the JSON response and display it in a clean table
- Add a "category" field — ask Claude to categorise each item (food, transport, office, etc.)
Stretch: Build a Next.js drag-and-drop page where users upload receipt images and get a monthly expense summary automatically.
type: "image", source: {type: "base64"|"url", ...}
type: "document", source: {media_type: "application/pdf", data: base64}
20 images per request, up to 20MB each
Claude reads text in images natively — no preprocessing required
Pass a public image URL directly — faster than base64 for remote images
Combine image input with tool use for structured extraction workflows