How to Build RAG with Supabase pgvector and Next.js
How to Build RAG with Supabase pgvector and Next.js
Retrieval-Augmented Generation (RAG) is transforming how we build AI applications. Instead of relying solely on an LLM's training data, RAG systems retrieve relevant information from your own data sources and inject it into the prompt. This gives you accurate, up-to-date answers without the hallucinations that plague pure LLMs.
In this guide, I'll show you how to build a production-ready RAG system using Supabase pgvector and Next.js. We'll cover everything from database setup to streaming chat interfaces, with real code you can ship today.
What is RAG and Why Use Supabase?
RAG stands for Retrieval-Augmented Generation. Here's the simplified flow:
- User asks a question
- Your app converts the question into a vector embedding
- You search your database for similar content using vector similarity
- You inject the relevant content into the LLM's context
- The LLM generates an answer based on your actual data
Traditional approaches use specialized vector databases like Pinecone or Weaviate. But for most projects, Supabase pgvector is a better choice:
- One database for everything: Store vectors alongside your relational data
- Lower costs: No separate vector database subscription
- Familiar SQL: Write queries you already understand
- Real-time subscriptions: Built-in WebSocket support
- Row-level security: Protect user data at the database level
I've used this pattern for client documentation search, customer support chatbots, and internal knowledge bases. For projects under 1 million vectors, Supabase pgvector performs beautifully.
Architecture Overview
Here's what we're building:
User → Next.js API Route → Supabase pgvector → OpenAI
↓ ↓ ↓
Embed query Find similar docs Generate answer
Component breakdown:
- Embedding table: Stores document chunks with their vector embeddings
- API routes: Handle embedding generation and similarity search
- Chat interface: Streams responses to the user
- Background jobs: Process and embed new content
The beauty of this architecture is that each piece is independently testable and replaceable. Need a different LLM? Swap OpenAI for Anthropic. Want semantic caching? Add Redis. The core pattern stays the same.
Step 1: Setting Up Supabase pgvector
First, enable the pgvector extension in your Supabase project:
-- Run in Supabase SQL Editor
create extension if not exists vector;
-- Create embeddings table
create table embeddings (
id uuid primary key default gen_random_uuid(),
content text not null,
embedding vector(1536), -- OpenAI ada-002 dimension
metadata jsonb default '{}'::jsonb,
created_at timestamp with time zone default now()
);
-- Create index for similarity search
create index on embeddings using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
-- Enable Row Level Security
alter table embeddings enable row level security;
-- Policy: Everyone can read embeddings
create policy "Public embeddings are viewable by everyone"
on embeddings for select
using (true);
The ivfflat index is crucial for performance. It uses approximate nearest neighbor search, which is 10-100x faster than exact search for large datasets. The lists parameter controls the trade-off between speed and accuracy. Start with 100 and increase if you have millions of vectors.
Save this as a Supabase migration in supabase/migrations/.
Step 2: Generating Embeddings in Next.js
Create an API route to process documents and generate embeddings:
// app/api/embeddings/generate/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { Configuration, OpenAIApi } from 'openai';
import { createClient } from '@supabase/supabase-js';
const openai = new OpenAIApi(
new Configuration({ apiKey: process.env.OPENAI_API_KEY })
);
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
// Split text into chunks (simple implementation)
function chunkText(text: string, maxLength: number = 1000): string[] {
const chunks: string[] = [];
const paragraphs = text.split('\n\n');
let currentChunk = '';
for (const paragraph of paragraphs) {
if ((currentChunk + paragraph).length > maxLength) {
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = paragraph;
} else {
currentChunk += '\n\n' + paragraph;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
return chunks;
}
export async function POST(req: NextRequest) {
const { content, metadata } = await req.json();
// Split content into chunks
const chunks = chunkText(content);
// Generate embeddings for all chunks
const embeddingsData = await Promise.all(
chunks.map(async (chunk) => {
const response = await openai.createEmbedding({
model: 'text-embedding-ada-002',
input: chunk,
});
return {
content: chunk,
embedding: response.data.data[0].embedding,
metadata,
};
})
);
// Insert into Supabase
const { error } = await supabase
.from('embeddings')
.insert(embeddingsData);
if (error) {
return NextResponse.json({ error: error.message }, { status: 500 });
}
return NextResponse.json({
success: true,
chunksProcessed: chunks.length,
});
}
Pro tip: OpenAI's API has rate limits. For large document sets, implement a queue system like BullMQ or use background jobs with Vercel Cron. I cover resilient API patterns in my post on building a circuit breaker pattern.
Step 3: Similarity Search Implementation
Now create the search endpoint:
// app/api/embeddings/search/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { Configuration, OpenAIApi } from 'openai';
import { createClient } from '@supabase/supabase-js';
const openai = new OpenAIApi(
new Configuration({ apiKey: process.env.OPENAI_API_KEY })
);
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
export async function POST(req: NextRequest) {
const { query, matchCount = 5, threshold = 0.5 } = await req.json();
// Generate embedding for user query
const embeddingResponse = await openai.createEmbedding({
model: 'text-embedding-ada-002',
input: query,
});
const queryEmbedding = embeddingResponse.data.data[0].embedding;
// Similarity search using pgvector
const { data: matches, error } = await supabase.rpc('match_embeddings', {
query_embedding: queryEmbedding,
match_count: matchCount,
match_threshold: threshold,
});
if (error) {
return NextResponse.json({ error: error.message }, { status: 500 });
}
return NextResponse.json({ matches });
}
Create the matching function in Supabase:
-- Run in Supabase SQL Editor
create or replace function match_embeddings(
query_embedding vector(1536),
match_count int default 5,
match_threshold float default 0.5
)
returns table (
id uuid,
content text,
metadata jsonb,
similarity float
)
language sql stable
as $$
select
embeddings.id,
embeddings.content,
embeddings.metadata,
1 - (embeddings.embedding <=> query_embedding) as similarity
from embeddings
where 1 - (embeddings.embedding <=> query_embedding) > match_threshold
order by embeddings.embedding <=> query_embedding
limit match_count;
$$;
The <=> operator performs cosine distance calculation. We subtract from 1 to get similarity scores (higher is better).
Performance tips:
- Cache embeddings for common queries using Redis
- Adjust
match_thresholdto balance relevance vs. retrieval count - Monitor query times and increase index lists if searches slow down
- Consider hybrid search (keyword + vector) for better results
Step 4: Building the Chat Interface
Now tie it together with a streaming chat interface using Vercel AI SDK:
// app/api/chat/route.ts
import { OpenAIStream, StreamingTextResponse } from 'ai';
import { Configuration, OpenAIApi } from 'openai';
const openai = new OpenAIApi(
new Configuration({ apiKey: process.env.OPENAI_API_KEY })
);
export async function POST(req: Request) {
const { messages } = await req.json();
const lastMessage = messages[messages.length - 1].content;
// Get relevant context from embeddings
const searchResponse = await fetch(
`${process.env.NEXT_PUBLIC_APP_URL}/api/embeddings/search`,
{
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query: lastMessage }),
}
);
const { matches } = await searchResponse.json();
// Build context from matches
const context = matches
.map((match: any) => match.content)
.join('\n\n---\n\n');
// Inject context into system message
const systemMessage = {
role: 'system',
content: `You are a helpful assistant. Use the following context to answer the user's question. If the answer isn't in the context, say so.
Context:
${context}`,
};
// Generate streaming response
const response = await openai.createChatCompletion({
model: 'gpt-4',
messages: [systemMessage, ...messages],
stream: true,
});
const stream = OpenAIStream(response);
return new StreamingTextResponse(stream);
}
Frontend component:
// app/components/ChatInterface.tsx
'use client';
import { useChat } from 'ai/react';
export default function ChatInterface() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({
api: '/api/chat',
});
return (
<div className="flex flex-col h-screen max-w-4xl mx-auto p-4">
<div className="flex-1 overflow-y-auto space-y-4 mb-4">
{messages.map((message) => (
<div
key={message.id}
className={`p-4 rounded-lg ${
message.role === 'user'
? 'bg-blue-100 ml-auto max-w-md'
: 'bg-gray-100 mr-auto max-w-2xl'
}`}
>
<p className="text-gray-800">{message.content}</p>
</div>
))}
</div>
<form onSubmit={handleSubmit} className="flex gap-2">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask a question..."
disabled={isLoading}
className="flex-1 p-3 border border-gray-300 rounded-lg focus:ring-2 focus:ring-blue-500"
/>
<button
type="submit"
disabled={isLoading}
className="px-6 py-3 bg-emerald-600 text-white rounded-lg hover:bg-emerald-700 disabled:opacity-50"
>
Send
</button>
</form>
</div>
);
}
The Vercel AI SDK handles streaming, loading states, and message history automatically. You get a ChatGPT-like experience with minimal code.
Production Considerations
Before launching your RAG system, consider these factors:
Cost estimation:
- OpenAI embeddings: $0.0001 per 1K tokens
- Storage: ~6KB per embedding (1536 dimensions × 4 bytes)
- For 100K documents at 500 tokens each: ~$5 embedding cost, ~600MB storage
Monitoring:
- Track search latency (target: <200ms)
- Monitor relevance scores (avg similarity >0.7 indicates good matches)
- Log failed embeddings for reprocessing
- Set up alerts for API rate limits
Optimization:
- Cache common queries in Redis (10x faster)
- Pre-compute embeddings for frequently accessed content
- Use batching for bulk uploads (100-200 documents per batch)
- Implement request deduplication to prevent duplicate embeddings
Scaling:
- At 1M+ vectors, consider dedicated vector databases
- Use read replicas for high-traffic search endpoints
- Implement semantic caching to reduce OpenAI costs
- Consider fine-tuning embeddings for domain-specific content
Need help building a production RAG system? I specialize in AI-powered applications with Next.js and Supabase. Check out my services or view my portfolio to see similar projects I've shipped.
Next Steps
Once you have basic RAG working, here are advanced patterns to explore:
Hybrid search: Combine vector similarity with full-text search for better results. Use Supabase's built-in FTS alongside pgvector.
Multi-tenant RAG: Add tenant isolation using Row Level Security policies. Each customer's embeddings stay private.
Reranking: Use a reranking model like Cohere to improve result quality after initial retrieval.
Metadata filtering: Add WHERE clauses to your similarity queries to filter by document type, date, or user permissions.
Streaming updates: Use Supabase real-time subscriptions to update embeddings when content changes.
RAG is one of the most practical AI patterns you can implement today. It gives you the power of LLMs without the hallucination risks, and Supabase makes it surprisingly simple to build.
If you're building a RAG system and want expert guidance, get in touch. I help teams ship AI features that actually work in production.
Need Help Getting Things Done?
Whether it's a project you've been putting off or ongoing support you need, we're here to help.