Building a RAG Pipeline That Works in Production

Most Retrieval Augmented Generation (RAG) tutorials are dangerous. They show you how to load a PDF into a vector store in five minutes, but they fail to mention what happens when you scale to a million documents or when your LLM starts hallucinating costs. As a senior full-stack engineer running Thea Tech Solutions LTD, I have seen too many founders burn their budget on OpenAI tokens because their retrieval strategy was fundamentally flawed.

If you are building a RAG pipeline that actually works in production, you need to stop thinking about it as a simple database lookup. It is a distributed system problem involving data ingestion, embedding strategies, and strict guardrails. I have architected these systems for clients ranging from fintechs processing thousands of transactions to e-commerce platforms handling massive catalogs. The difference between a prototype and a production-ready RAG system comes down to three things: latency, relevance, and cost control.

Here is how I build them using my preferred stack: Next.js for the orchestration layer, Supabase (pgvector) for the vector database, Cloudflare Workers for edge processing, and AWS for heavy lifting.

The Architecture: Why Your Stack Matters

Before touching a single line of code, you need to choose a stack that won't collapse under load. I do not use standalone vector databases like Pinecone for most early-stage clients. The operational overhead of managing another service is often unnecessary. Instead, I lean on Supabase.

Supabase uses PostgreSQL under the hood with the pgvector extension. This is a game-changer. It allows you to store relational data (user permissions, metadata) right next to your vector embeddings. This means you can filter your RAG search using SQL before you even calculate vector similarity. For example, "Show me documents about *Q3 finances* (text search) that *belong to group A* (metadata filter)."

For the ingestion pipeline, I usually spin up a microservice on AWS Lambda or a container in ECS. This handles the heavy work: chunking text, generating embeddings via OpenAI or HuggingFace, and pushing them to Supabase. You do not want this running on your main web server, or you will block user requests during heavy ingestion jobs.

Ingestion: The "Garbage In, Garbage Out" Problem

The biggest reason RAG systems fail is poor chunking. If you just split a PDF by every 1000 characters, you are likely breaking sentences in half and losing semantic meaning. This forces the LLM to guess the context, leading to hallucinations.

In production, I use a recursive character splitter combined with semantic windowing. Here is a simplified example of how I handle this in a Next.js API route or an AWS Lambda function:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

async function ingestDocument(documentText) {
  // I prefer RecursiveCharacterTextSplitter for production
  // It respects paragraphs and sentences better than fixed-size chunks
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200, // Critical for maintaining context between chunks
    separators: ['

', '
', '. ', ' ', ''],
  });

  const chunks = await splitter.createDocuments([documentText]);

  // Generate embeddings (ideally batched to save costs)
  const embeddingResponse = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      input: chunks.map(c => c.pageContent),
      model: 'text-embedding-3-small'
    })
  });

  const embeddings = await embeddingResponse.json();

  // Store in Supabase (pseudocode)
  const { data, error } = await supabase.from('document_chunks').insert(
    chunks.map((chunk, i) => ({
      content: chunk.pageContent,
      embedding: embeddings.data[i].embedding,
      metadata: { source: 'legal_doc_v1' }
    }))
  );
}

The chunkOverlap parameter is the secret sauce here. By overlapping the chunks by 200 tokens, you ensure that key entities mentioned at the end of one chunk are preserved in the context of the next. This significantly improves the quality of the retrieval.

Retrieval: Hybrid Search is Non-Negotiable

If you rely solely on vector similarity search, your RAG pipeline will struggle with specific keywords. Vector search is great for semantic concepts (e.g., "fruit" finding "apple"), but it is terrible at exact matches (e.g., finding a specific invoice number "INV-2024-001").

A production-ready RAG pipeline must use Hybrid Search. This combines dense retrieval (vectors) with sparse retrieval (BM25/keyword search).

In Supabase, I implement this by combining pgvector for the similarity search and PostgreSQL's built-in full-text search (tsvector) for keyword matching. I then re-rank the results (often using a technique called Reciprocal Rank Fusion or RRF) to push the most relevant documents to the top.

Here is a conceptual Next.js API route handling the query:

// /api/rag/query

export default async function handler(req, res) {
  const { query } = req.body;

  // 1. Generate embedding for the user query
  const queryEmbedding = await generateEmbedding(query);

  // 2. Perform Hybrid Search in Supabase
  // We match documents that are semantically similar AND contain keywords
  const { data: documents, error } = await supabase.rpc('match_documents_hybrid', {
    query_embedding: queryEmbedding,
    query_text: query,
    match_threshold: 0.78, // Strict threshold to reduce noise
    match_count: 5
  });

  if (error) return res.status(500).json({ error: error.message });

  // 3. Construct the prompt with strict context
  const contextPrompt = `
    You are a helpful assistant. Answer the question based ONLY on the following context:
    ${documents.map(doc => doc.content).join('
---
')}
    
    If the answer is not in the context, say "I don't know".
    Question: ${query}
  `;

  // 4. Call LLM (GPT-4o or Haiku for speed)
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: contextPrompt }],
  });

  res.status(200).json({ answer: completion.choices[0].message.content });
}

Production Concerns: Latency and Caching

When you move from a prototype to production, latency becomes your enemy. A standard RAG request involves:

User Query -> API
Embedding Generation (~200ms)
Vector Search (~100ms)
Context Construction
LLM Generation (~1s - 10s)

Total time? Too slow for a real-time chat interface.

To fix this, I implement two specific strategies in my Next.js applications.

1. Pre-Computed Embeddings for Common Questions

I analyze user queries and cache the embeddings for the top 10% most common questions. This skips the OpenAI embedding call entirely, shaving off 200ms instantly.

2. Streaming Responses

I never wait for the full LLM response to load before sending it to the frontend. I use Vercel AI SDK or standard Server-Sent Events (SSE) to stream the tokens directly to the React Native or Next.js client. This makes the application feel instantaneous, even if the total generation time is 5 seconds.

Guardrails: Preventing Hallucinations

A RAG pipeline is useless if the LLM ignores the context and makes things up. To prevent this, I use a strict "citation" mechanism. I instruct the LLM to return the ID of the document chunk it used for every sentence.

Furthermore, I use Cloudflare Workers to run a lightweight "safety check" before the user even sees the response. This worker can filter out PII (Personally Identifiable Information) or block toxic responses at the edge, adding a security layer without adding latency to the main server.

The Cost Reality

Everyone wants to use GPT-4, but it costs 10x more than GPT-3.5 or Claude Haiku. In production, I use a cascading approach:

Route 1: Try to answer using cached responses (Cost: $0).
Route 2: Use a smaller, faster model (Haiku or GPT-3.5-Turbo) with high-relevance context (Cost: ~$0.00025 / 1k tokens).
Route 3: Only upgrade to GPT-4o if the confidence score from the vector search is low.

This strategy reduced one client's AI bill by 65% while actually improving answer accuracy because the smaller models are faster and less prone to "waffling" when the context is clear.

Conclusion

Building a RAG pipeline that actually works in production is not about copying a Python notebook. It is about building a robust data architecture with Supabase, optimizing ingestion with smart chunking, and implementing hybrid search to ensure relevance. You need to handle latency through streaming and caching, and you must guard your wallet by choosing the right model for the job.

If you are a founder or CTO looking to implement this but worried about the complexity and cost, do not waste months debugging embeddings. Get the architecture right from day one.

Book a free AI audit at theatechsolutions.com/ai-audit

← Back to all articles