Building AI Agents That Actually Work (And Don't Hallucinate)

Moving beyond chat demos to reliable AI agents requires strict tool governance, structured outputs, and a 'human-in-the-loop' architecture.

Everyone is shipping AI agents right now. If you look at Twitter or LinkedIn, it feels like we’ve moved from "chat with your PDF" to "fully autonomous employees" overnight. But here is the reality I see while consulting for startups here in Bangkok and globally: most of these agents are brittle, expensive, and unreliable.

I’ve spent the last six months integrating LLMs into production systems at Thea Tech Solutions. We’ve moved from simple RAG (Retrieval-Augmented Generation) to actual agentic workflows—systems that can reason, act, and correct themselves. The difference between a cool demo and a production-grade agent is brutal. It comes down to how you handle the "loop" and how strictly you control the data.

The Tool-Calling Trap

Early on, we made the mistake of giving the LLM too much freedom. We exposed a whole library of functions to GPT-4, hoping it would magically know when to query the database, when to call an API, and when to just answer from context. It didn't work. The model would hallucinate parameters, call the wrong tools, or get stuck in loops trying to fix a non-existent error.

Now, I take a much more opinionated approach: Tool Governance.

Instead of exposing raw database schemas or generic API endpoints, I wrap every single action in a strict, purpose-built function. I don't let the model write SQL. I don't let it construct JSON payloads for internal APIs. I give it a menu of pre-validated functions.

// Don't do this: Letting the LLM write SQL
const result = await db.sql(model.generateSQL(query));

// Do this: Specific, guarded tools
const tools = {
  searchCustomerDatabase: {
    description: "Finds a user by email or ID",
    parameters: { type: "object", properties: { identifier: { type: "string" } } },
    execute: async (identifier) => {
      // Sanitized, pre-written query
      return await db.customers.findOne({ where: { email: identifier } });
    }
  }
};

This reduces the surface area for failure. If the agent calls searchCustomerDatabase, I know exactly what query it’s going to run. It makes debugging easier and security much tighter.

The Power of Structured Outputs

The biggest leap in reliability for us wasn't a better model—it was Structured Outputs. OpenAI released structured output support recently, and it changes the game for backend integration.

Before this, parsing LLM responses was a nightmare. You’d ask for JSON, and it would wrap the JSON in markdown code blocks. Or it would add a conversational preamble like "Here is the data you requested." Parsing that in production is fragile.

Now, I define a Zod schema (or Pydantic in Python) and force the model to conform to it. If the model can't fit the data into the schema, the API call fails, and I handle the error gracefully.

import { z } from "zod";

const CustomerIntent = z.object({
  category: z.enum(["support", "sales", "refund"]),
  urgency: z.enum(["low", "medium", "high"]),
  summary: z.string().max(100),
  next_step: z.string()
});

// The response is guaranteed to match this schema
const classification = await openai.beta.chat.completions.parse({
  model: "gpt-4o-2024-08-06",
  messages: [{ role: "user", content: userInput }],
  response_format: { type: "json_schema", json_schema: CustomerIntent }
});

This allows me to route the agent programmatically. If urgency is "high", I trigger a PagerDuty alert or a Slack notification to the ops team. If it's "sales", I push a lead to HubSpot. I don't need regex to guess what the model meant; the data is typed and ready for my Next.js API routes.

Human-in-the-Loop is Not Optional

There is a hype cycle around "fully autonomous" agents. In my experience, fully autonomous agents in a business context are a liability. You don't want an AI agent refunding customers $10,000 because it misinterpreted a support ticket.

I always design for Human-in-the-Loop (HITL).

We use a queuing system, usually backed by Redis or Supabase, to hold "critical" actions before they execute.

• Agent proposes action: "I think I should refund this user."

• System pauses: Action is saved to database with status pending_approval.

• Human reviews: A simple internal dashboard built with React shows the proposed action and the reasoning.

• Execution: Once approved, the system runs the function.

This builds trust. You can start with strict approval required for everything, and then loosen the reins as you refine the agent's logic. For low-risk tasks (like summarizing a meeting), you can skip the loop. For high-risk tasks (like sending emails or moving money), the loop is mandatory.

The Architecture Stack

If I were to build an agent today from scratch, this is the stack I’d pick:

* Orchestration: LangGraph. LangChain is okay for simple chains, but LangGraph handles cyclic graphs much better. Agents need to be able to loop—observe, think, act, observe again. LangGraph models this state machine perfectly.

* Backend: Next.js (App Router). We use API routes to handle the webhooks and serve the agent's logic. It keeps the frontend and backend logic in one repo, which speeds up iteration.

* Vector Store: Supabase (pgvector). For most RAG use cases, you don't need a separate vector database. Postgres is fast enough if you index correctly, and it keeps your data in one place.

* Observability: LangSmith. You cannot debug agents by just reading logs. You need to see the trace—every prompt, every tool call, and every intermediate thought. LangSmith is currently the best tool for visualizing these traces.

Cost Optimization

LLMs can burn cash fast if you aren't careful. A common pattern I see is developers jumping straight to GPT-4o for every task. That’s a waste.

I use a cascading model strategy.

• Routing: A small, fast model (like GPT-4o-mini or Llama 3) classifies the incoming request. Is this complex? Does it require reasoning?

• Execution: If the task is simple (e.g., "extract the email address"), the small model handles it. If it's complex (e.g., "negotiate this contract clause"), pass it to the heavy hitter (GPT-4o or Claude 3.5 Sonnet).

This simple routing step can cut your API costs by 60-70%. Also, aggressive caching is essential. If a user asks the same question, don't hit the API. Cache the response locally or use a semantic cache.

The Takeaway

AI agents are the future, but we are still in the "wild west" phase of engineering. The models are smart, but they need guardrails. If you are building this today, stop worrying about "prompt engineering" and start worrying about system architecture.

Focus on structured data, strict tool definitions, and keeping a human eye on the critical path. That is how you move from a demo that works "most of the time" to a system that works at scale.