Designing GenAI Agents That Don't Hallucinate Themselves Into Production Incidents

A “GenAI agent” is a system that takes a goal, plans steps, calls tools, and produces a result. That’s most of the interesting GenAI work happening right now. It’s also the area where it’s easiest to ship something that looks impressive in a demo and quietly does something dangerous in production.

The risk is not that the model hallucinates a fact. The risk is that the model hallucinates a tool call. A retrieval-augmented chatbot that says something wrong is embarrassing. An agent with write access that confidently executes the wrong action is an incident.

This is how we build agents that don’t get there.

Where the failure modes are

It helps to be specific about what can go wrong:

Wrong tool selected. The agent picks transfer_funds when the user asked about checking a balance.
Wrong arguments to the right tool. The agent calls cancel_order(order_id=...) with the wrong order ID, or the wrong account ID.
Right tool, right arguments, wrong context. The agent does what was asked, but the user shouldn’t have been permitted to ask it.
Infinite loops. The agent keeps retrying or alternating between tools without converging.
Hallucinated output. The tool returned nothing useful, but the agent invents a plausible-sounding answer anyway.
Sensitive data leakage. The agent fetches data the calling user shouldn’t see, then includes it in its response.

Different failure modes need different controls. Mixing them up (treating a permissions problem as a prompt problem) is how teams end up with agents that fail in production despite extensive prompt tuning.

The control surfaces

We layer four kinds of controls on every production agent:

┌─────────────────────────────────────────────────────────┐
│  1. Structural controls (what the model can produce)     │
│     - Response schema, tool schemas, output validation   │
│     - Vertex AI safety filters, Model Armor (GA)         │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│  2. Operational controls (how the agent runs)            │
│     - Max turns, max tool calls, latency budgets         │
│     - Tool-specific quotas, write-action confirmation    │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│  3. Behavioral controls (what the model is told)         │
│     - System prompt, tool descriptions, few-shot         │
│     - Refusal patterns, deterministic routes             │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│  4. Evaluation controls (how we know it's working)       │
│     - Goldens for tool selection, regression suites      │
│     - Production traffic sampling and replay             │
└─────────────────────────────────────────────────────────┘

Structural controls

Tool schemas, taken seriously

Gemini function calling is schema-driven. Every tool has a JSON schema for its arguments. Make those schemas as restrictive as you can.

A loose schema like:

{
  "name": "cancel_order",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": { "type": "string" }
    }
  }
}

invites errors. A tight schema:

{
  "name": "cancel_order",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": {
        "type": "string",
        "pattern": "^ORD-[0-9]{8}$",
        "description": "Order ID in format ORD-XXXXXXXX. Must come from a previous retrieval, never inferred."
      },
      "reason_code": {
        "type": "string",
        "enum": ["customer_request", "fraud_suspected", "out_of_stock", "duplicate"]
      }
    },
    "required": ["order_id", "reason_code"]
  }
}

makes both the structure and the intent legible. The model is less likely to fabricate values that don’t match the pattern, and your downstream validator can reject them before execution.

Structured output for the final response

Don’t let the agent return free-form text where structured data is expected. Use Gemini’s response schema to enforce JSON output for the response shape. Validate it before returning to the caller.

Safety filters and Model Armor

Vertex AI’s built-in safety filters (harassment, hate speech, sexually explicit, dangerous content) are on by default and tunable per category. Configure them deliberately. Default thresholds are reasonable for consumer-facing applications and often too restrictive for internal tools where dangerous content has to be discussed (compliance review, security analysis).

Model Armor is GA and natively integrated with the Vertex AI Gemini API’s generateContent method. It handles prompt-injection and jailbreak detection (up to 10K tokens), PII / credentials / secrets filtering via Sensitive Data Protection (150+ types), and IP-leakage protection. It also integrates with Apigee and Agentspace. Treat it as the default first step before any custom prompt-injection check on agents that process user-supplied text and then take action — turn it on, configure the templates, and only layer your own logic on top once you understand what it catches.

Operational controls

Max turns, max tool calls

Every agent loop has hard limits. Max turns (default 10), max tool calls per turn (default 5), max total wall-clock time (default 30 seconds). When a limit hits, the agent returns a structured “I couldn’t complete this within the available time” response and the request is logged for analysis.

These aren’t suggestions to the model. They’re enforced by the agent loop code:

async def run_agent(query: str, max_turns: int = 10, deadline_s: float = 30.0):
    start = time.monotonic()
    history = [{"role": "user", "content": query}]

    for turn in range(max_turns):
        if time.monotonic() - start > deadline_s:
            return fallback_response("deadline_exceeded", history)

        response = gemini.generate_content(history, tools=AGENT_TOOLS)

        if not response.function_call:
            return response.text  # final answer

        tool_name = response.function_call.name
        tool_args = response.function_call.args

        if not validate_tool_call(tool_name, tool_args):
            return fallback_response("invalid_tool_call", history)

        result = await execute_tool(tool_name, tool_args)
        history.append({"role": "model", "content": response})
        history.append({"role": "tool", "content": result})

    return fallback_response("max_turns_exceeded", history)

Tool-specific quotas and write-action confirmation

Read tools can be called freely. Write tools are different. Every write tool in a production agent should:

Have a per-user, per-time-window quota (no user should be able to trigger 100 cancel_order calls in a minute regardless of what the agent decides).
Require an explicit confirmation pattern. The agent calls propose_cancel_order (which doesn’t execute), the user confirms, then execute_cancel_order runs.
Be auditable. Every write call is logged to Cloud Logging with the full agent state at decision time, the user ID, the tool arguments, and a trace ID.

Blast-radius controls

If something goes wrong with the agent, what’s the worst it can do? That’s the question to ask at design time. Use it to scope tool permissions.

If the agent can call update_customer_record, can it update any customer record, or only the calling user’s? IAM-aware tools, where the tool itself enforces row-level access against the calling user’s identity, are the right pattern. The agent gets to call the tool; the tool gets to decide if the call is permitted.

Behavioral controls

System prompt as policy

The system prompt isn’t just persona. It’s policy. Things we put in every production agent system prompt:

Explicit refusal conditions: “If the user asks you to do X, decline and explain why.”
Tool usage policy: “Use get_account_balance for balance queries. Do not call write tools without explicit user confirmation.”
Citation policy: “Cite the source of any factual claim. If no source supports a claim, do not make the claim.”
Escalation patterns: “If the user expresses frustration or requests human assistance, call escalate_to_human and stop.”

Tool descriptions, written carefully

The tool description is what Gemini reads to decide when to call the tool. Treat it like documentation for someone with strong reasoning but no context.

Bad: "Cancel an order."

Better: "Cancel a customer order. Use only when the user has explicitly requested cancellation and provided the order ID. Do not use to 'fix' problems or as a fallback. If the order is already shipped, this will fail and the user should be directed to returns."

Deterministic routes

Not everything has to be model-driven. If “I want to check my balance” can be matched by an intent classifier with high confidence, route directly to the balance check. The agent loop only handles the cases the deterministic router can’t resolve.

This pattern is what Agent Builder (now part of the Gemini Enterprise Agent Platform, the Cloud Next 2026 umbrella covering Agent Builder, ADK, Agent Engine, Agent Studio, Agent Garden, and Agentspace) calls “deterministic playbooks.” For tasks that should always do the same thing, deterministic routing is faster, cheaper, and safer than agent reasoning.

Human-in-the-loop fallback

For any tool call that the system flags as high-risk (write actions, financial transactions, anything customer-visible), the path should include human-in-the-loop. The agent prepares the action, a human approves it in a separate UI, then the action executes.

This is the pattern almost every production agent ends up with for write operations, after the first incident.

Evaluation controls

Goldens for tool selection

Standard generation evals (does the response contain the right answer) aren’t enough for agents. You also need:

Tool selection goldens: input scenarios with ground-truth correct tool calls. Evaluate whether the agent chose the right tool.
Tool argument goldens: did it pass the right arguments? Particularly for IDs, dates, and enum values.
Tool sequence goldens: for multi-step tasks, did it call the tools in the right order?

These live alongside response-quality goldens in your eval harness. See the full evaluation post for harness design.

Regression suites

Every production incident produces a new golden. The format is consistent: input, what the agent did wrong, what it should have done. Add it to the regression suite. Re-run the suite on every prompt change, every model upgrade, every tool definition change.

Production traffic sampling and replay

Sample 1% of production agent runs into BigQuery. Periodically (weekly), review them for surprising tool calls or near-misses. The patterns you find in real traffic are the goldens you didn’t think to write.

When to use Vertex AI Agent Builder vs. roll your own

Vertex AI Agent Builder — now part of the Gemini Enterprise Agent Platform — gives you the agent loop, function calling, tool integration, deterministic playbooks, and a hosted runtime. Alongside it sit Agent Engine (managed runtime), Agent Studio (visual canvas), and the Agent Development Kit (ADK). It’s the default starting point for production agents.

ADK is the open-source, code-first alternative (Python, Go, Java, TypeScript) for engineering teams that want the same agent primitives without committing to the managed runtime. It’s the right pick when your team prefers to own the loop in their own service while keeping compatibility with Agent Engine should you decide to deploy there later.

Roll your own (writing the loop in Cloud Run with direct Gemini calls, with or without ADK) when:

You need very fine-grained control over the prompt construction or tool execution order
Your tools have unusual auth requirements
You’re building research-style agents that need to compose differently from Agent Builder’s model

For most production use cases, Agent Builder is the right call. The “build everything from scratch” instinct costs you in operational maturity later.

How Accelyze helps

We design and build production GenAI agents on Google Cloud, with the layered controls described above. Engagements typically start with a use-case audit (which actions are safe to delegate to a model, which require human-in-the-loop), an agent design with explicit tool schemas, and an eval harness that covers tool selection as well as output quality. If you’re considering an agent for a customer-facing or write-enabled workflow, get in touch.

GenAI Strategy & Readiness

Pilot to Production Delivery

MLOps & Platform Enablement

GenAI Risk & Governance