The AI Agent Stacking Problem: Why Your LLM Calls Are Failing in Production

Moving from a working prototype to a reliable AI agent requires solving the complex issues of state management, structured outputs, and deterministic workflows.

Last week, a client asked me to integrate an AI sales agent into their existing CRM. They had a working prototype running off a few OpenAI API calls in a Python script, and they assumed moving it to production was just a matter of swapping the API key for a production one.

I had to burst that bubble. A script that hallucinates 10% of the time is a fun experiment; a production agent that hallucinates 10% of the time is a liability that deletes customer data.

We are currently in the 'trough of disillusionment' regarding AI agents. The hype says autonomous agents will replace engineers. The reality is that getting an LLM to reliably execute a multi-step workflow—without forgetting the context or inventing facts—is incredibly difficult. I've spent the last six months building these systems at Thea Tech Solutions, and I've learned that the core challenge isn't the model's intelligence. It's the engineering around it.

Here is how I approach building AI agents that actually survive in the wild.

The Illusion of Intelligence

The biggest mistake I see is treating the LLM like a magic box that understands your codebase. It doesn't. It predicts the next token. If you give it a vague prompt like 'Schedule a meeting for the user,' you are rolling dice every time.

In production, I don't rely on the LLM to know how to interface with my Supabase database or my Google Calendar API. Instead, I use a tool-calling architecture where the model is strictly a router.

I define a strict schema for the tools the agent can access. If the agent needs to book a slot, it doesn't generate free-form text. It outputs a JSON object matching a specific Zod schema.

import { z } from 'zod';

const BookMeetingTool = {
  name: 'book_meeting',
  description: 'Book a meeting on the calendar',
  parameters: z.object({
    user_id: z.string().describe('The unique user identifier'),
    time: z.string().datetime().describe('ISO 8601 datetime string'),
    topic: z.enum(['Sales', 'Support', 'Onboarding']).describe('Type of meeting')
  })
};

By enforcing this schema, I turn a fuzzy natural language request into a deterministic function call. The LLM fails fast if it can't extract the parameters, rather than silently hallucinating a booking for next Tuesday when the office is closed.

The State Management Nightmare

State is where most agents break down. In a standard web app, state is managed in a database or a client-side store. In an agentic system, the state of the conversation is the only thing that matters. If the agent loses track of step 1, step 3 becomes impossible.

I used to try to stuff the entire chat history into the context window. This works for a demo, but it gets expensive and slow fast. More importantly, it creates a 'lost in the middle' problem where the model forgets the initial constraints.

Now, I treat conversation history like a rolling log. I use a vector database—specifically pgvector in Supabase since we already use Postgres—to summarize and compress older interactions. But for the immediate workflow, I maintain a 'working memory' object in my backend.

This working memory isn't just chat logs. It contains the extracted variables, the current step in the workflow, and error flags.

interface AgentMemory {
  current_step: 'greeting' | 'gathering_info' | 'executing' | 'complete';
  collected_data: {
    name?: string;
    email?: string;
    urgency?: 'low' | 'medium' | 'high';
  };\n  errors: string[];
}

Before every LLM call, I inject this memory into the system prompt. This ensures the model knows exactly where it is in the process. It's not just chatting; it's filling out a form dynamically.

Deterministic vs. Probabilistic Flows

There is a temptation to make everything an 'agent.' But sometimes, a good old-fashioned if statement is better.

I classify workflows into two categories: deterministic and probabilistic.

If the process requires 100% accuracy—like processing a payment or updating a user's subscription tier—I do not let the LLM execute the logic directly. The LLM acts as the parser. It extracts the intent and parameters, but my Next.js API route verifies the data and runs the transaction.

If the process is creative or exploratory—like drafting a marketing email or summarizing a document—I give the LLM more freedom.

Mixing these two up is fatal. I once saw a system where an agent was given direct access to a SQL DELETE command. It worked fine until a user asked to 'remove the last item from my cart,' and the agent interpreted 'item' as the database row. We rolled back the database, but I learned my lesson: always sandbox the execution environment.

The Vercel AI SDK and LangChain

I've tried building my own orchestration layer, and I've tried heavy frameworks like LangChain.

For most use cases, I prefer the Vercel AI SDK. It abstracts away the streaming logic and tool generation, letting me focus on the prompts and schemas. It integrates seamlessly with Next.js, which is my default stack.

However, I avoid the 'black box' functions. I want to see every tool call, every parameter extraction, and every retry. When an agent fails, I need to know if it was a prompt engineering issue or a code logic issue. The SDK allows me to hook into onToolCall to log these events to Cloudflare Workers analytics, which gives me visibility without the overhead of a heavy tracing platform.

The Takeaway

The stack for AI agents is still stabilizing, but the principles of software engineering are not.

Don't trust the model. Verify its output with schemas. Don't rely on context. Maintain explicit state. And don't confuse a chatbot with an autonomous worker.

The future isn't replacing developers with agents. It's developers who know how to orchestrate these models to handle the boring, repetitive logic while we focus on the complex architecture.

If you are struggling to move your AI prototype from a notebook to a scalable API, stop treating it like a chatbot and start treating it like a distributed system.