Building RAG Systems
Retrieval-Augmented Generation: teach Claude about YOUR data without fine-tuning.
What is RAG and Why Does It Matter?
Claude knows a lot — but not your company wiki, your product docs, or last week's sales data. RAG solves this by retrieving relevant chunks of your data at query time and stuffing them into Claude's context window.
- Ingest — chunk your documents, embed each chunk into a vector
- Store — save vectors in a vector database (Supabase pgvector, Pinecone, etc.)
- Retrieve — embed the user query, find closest chunks by cosine similarity
- Generate — pass retrieved chunks + question to Claude, get grounded answer
Step 1: Chunking Strategy
How you split documents dramatically affects quality. Bad chunking = bad retrieval = hallucinations.
# pip install anthropic supabase
def chunk_text(text: str, chunk_size: int = 500,
overlap: int = 50) -> list[str]:
"""
Sliding window chunker.
- chunk_size: characters per chunk (~125 tokens)
- overlap: characters shared between adjacent chunks
(preserves context at boundaries)
"""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap # slide with overlap
return chunks
# Better: use LangChain's RecursiveCharacterTextSplitter
# which tries to split on paragraphs, then sentences, then words
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50,
separators=["
", "
", ". ", " ", ""]
)
chunks = splitter.split_text(your_document)- 500-1000 characters (~125-250 tokens) works well for most docs
- Always overlap 10-15% to avoid cutting mid-sentence
- Store metadata (source, page, section) with every chunk
- For structured data (FAQs, tables), chunk by logical unit, not character count
Step 2: Embeddings + Supabase pgvector
We'll use OpenAI's embedding model (or Voyage AI, which is excellent for Claude workflows) and store vectors in Supabase — free tier included.
-- In Supabase SQL editor:
create extension if not exists vector;
create table documents (
id bigserial primary key,
content text,
metadata jsonb,
embedding vector(1536) -- dimension matches your embedding model
);
create index on documents
using ivfflat (embedding vector_cosine_ops)
with (lists = 100); -- ~sqrt(num_rows) is a good starting point
-- Similarity search function
create or replace function match_documents(
query_embedding vector(1536),
match_count int default 5,
match_threshold float default 0.7
)
returns table(id bigint, content text, metadata jsonb, similarity float)
language sql stable as $$
select id, content, metadata,
1 - (embedding <=> query_embedding) as similarity
from documents
where 1 - (embedding <=> query_embedding) > match_threshold
order by embedding <=> query_embedding
limit match_count;
$$;Step 3: Ingest Pipeline (Python)
import anthropic
from supabase import create_client
import voyageai # pip install voyageai (great for RAG)
claude = anthropic.Anthropic()
supabase = create_client(SUPABASE_URL, SUPABASE_KEY)
vo = voyageai.Client(api_key=VOYAGE_API_KEY)
def embed(texts: list[str]) -> list[list[float]]:
result = vo.embed(texts, model="voyage-3", input_type="document")
return result.embeddings
def ingest_document(text: str, metadata: dict):
chunks = chunk_text(text) # from Step 1
embeddings = embed(chunks)
rows = [
{"content": c, "metadata": metadata, "embedding": e}
for c, e in zip(chunks, embeddings)
]
supabase.table("documents").insert(rows).execute()
# Usage
ingest_document(open("handbook.txt").read(),
{"source": "handbook", "version": "2024"})Step 4: Query + Generate
def ask(question: str) -> str:
# Embed the question
q_embedding = vo.embed(
[question], model="voyage-3", input_type="query"
).embeddings[0]
# Retrieve top-5 chunks from Supabase
results = supabase.rpc("match_documents", {
"query_embedding": q_embedding,
"match_count": 5,
"match_threshold": 0.7
}).execute()
if not results.data:
return "I don't have information about that."
# Build context from retrieved chunks
context = "
---
".join(
r["content"] for r in results.data
)
# Generate grounded answer
response = claude.messages.create(
model="claude-opus-4-5",
max_tokens=512,
system="""You are a helpful assistant. Answer using ONLY
the context provided. If the answer isn't in the context,
say "I don't have that information."
Context:
""" + context,
messages=[{"role": "user", "content": question}]
)
return response.content[0].text
print(ask("What is our parental leave policy?"))RAG Quality Improvements
- Re-ranking: retrieve 20 chunks, use a cross-encoder to re-rank, keep top 5
- Hybrid search: combine vector similarity with keyword (BM25) search
- HyDE: ask Claude to generate a hypothetical answer, embed that for retrieval
- Metadata filtering: filter by date, source, or category before similarity search
- Smaller chunks for retrieval, larger for generation: retrieve small, expand context before feeding to Claude
Hands-on: Build a Docs Q&A Bot
Challenge: Build a RAG pipeline over a set of markdown docs (your own notes, a project README, or any text file).
- Chunk 3+ documents and store in Supabase with the SQL schema above
- Build a query function that retrieves and passes to Claude
- Test with 5 questions — note where it gets it right vs. wrong
- Add source citation: include the metadata source in the response
Stretch: Add a confidence score — if the highest similarity is below 0.75, have Claude say it's not sure rather than hallucinating.
Retrieve relevant chunks at query time, pass to Claude as context
500-1000 chars, 10-15% overlap, split on paragraphs first
Postgres extension for vector storage — built into Supabase
1 - (embedding <=> query_embedding) — ranges 0 to 1
Anthropic-recommended embedding model, great for Claude RAG
Generate hypothetical answer, embed it for better retrieval