- Apr 14, 2026
- 9 min read
GenAI for Customer Support: A Reference Build on Vertex AI
Customer support is the GenAI use case with the clearest ROI story. Tickets cost money to handle. Agents are expensive. Deflection rates of 20 to 40% are achievable, and the metrics that prove it (resolution rate, escalation correctness, CSAT) are already instrumented in most support organizations. The economics are unambiguous.
It’s also the use case where shipping badly produces public damage. A support bot that hallucinates a policy, gives the wrong refund instructions, or fails to escalate an angry customer is worse than no bot at all.
This post is the reference build we use for customer support GenAI on Vertex AI, plus the eval and operational discipline that has to come with it.
What “support GenAI” actually means
The term is overloaded. Useful to be specific about what gets built:
| Capability | What it is | Difficulty |
|---|---|---|
| FAQ answering | Respond to common questions from a knowledge base | Low |
| Ticket triage | Classify, route, prioritize incoming tickets | Low |
| Ticket summarization | Generate summaries for agents handling complex tickets | Low |
| Agent assist | Suggest responses to support agents in real time | Medium |
| Self-service resolution | Resolve customer issues end-to-end without human involvement | High |
| Proactive outreach | Detect issues from product signals, reach out before a customer asks | Very high |
We strongly recommend starting at the top of this list. Agent assist (suggesting drafts to a human agent who reviews and edits) is the right entry point for most enterprises. The human agent stays in the loop, deflection-quality concerns are bounded by the agent’s review, and the data you collect (agent edits to suggested drafts) is exactly what you need to evaluate moving to self-service later.
The reference architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ Customer-facing channels │
│ Web chat, in-app messaging, email, voice (via Conversational Agents) │
└────────────────────────────┬─────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Conversational Agents console (Dialogflow CX + Agent Builder, unified) │
│ Intent classification, deterministic playbooks, NLU │
│ Routes between deterministic flows and Gemini-driven generation │
└────────┬─────────────────────────────────────────┬───────────────────────┘
│ │
│ deterministic │ generative
│ │
┌────────▼────────────┐ ┌──────────────────▼───────────────────────┐
│ Playbook flows │ │ Gemini with grounding │
│ (auth, lookup, │ │ - Vertex AI Search over KB │
│ transactional │ │ - Customer context tool │
│ actions) │ │ - Order/account tool (read-only) │
└────────┬────────────┘ └──────────────────┬───────────────────────┘
│ │
└──────────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Response orchestration │
│ - Output validation (response schema, safety filters) │
│ - Confidence gating (escalate to human if low confidence) │
│ - Logging to Cloud Logging + BigQuery (every interaction sampled) │
└────────┬─────────────────────────────────────────┬───────────────────────┘
│ │
▼ ▼
┌────────────────────┐ ┌────────────────────────────────────┐
│ Customer reply │ │ Human agent handoff │
│ │ │ (via existing helpdesk: Zendesk, │
│ │ │ Salesforce Service Cloud, etc.) │
└────────────────────┘ └────────────────────────────────────┘
Analytics + eval (cross-cutting):
BigQuery: every conversation logged, classified, joined with helpdesk data
Vertex AI Evals: nightly run against golden conversations
Vertex AI Conversation Analytics: aggregate metrics (deflection rate, escalation rate, CSAT)
Looker Studio: ops dashboard for support managers
The component choices, with rationale:
Dialogflow CX for the conversation surface and intent classification. It’s mature, multi-channel, integrates with Genesys, Twilio, and the major contact center platforms. The intent classifier handles things like “I want to speak to a human” without ever calling Gemini. Note: as of 2025-10-31 the standalone Dialogflow CX console was deprecated in favor of the unified Conversational Agents console, which now surfaces Dialogflow CX and Vertex AI Agent Builder (now part of the Gemini Enterprise Agent Platform, the Cloud Next 2026 umbrella covering Agent Builder, ADK, Agent Engine, Agent Studio, Agent Garden, and Agentspace) together. The underlying CX engine is unchanged; the technical decomposition described here (CX for NLU, Agent Builder for generation) still applies.
Vertex AI Agent Builder for the generative parts. Function calling, tool integration with backend systems, deterministic playbook routing for high-confidence intents.
Vertex AI Search (now branded Agent Search; APIs still use Vertex AI Search / Discovery Engine endpoints) over the knowledge base. Grounding source for Gemini. Citations come back in the response. KB updates are picked up automatically.
Customer context tool. A function the agent can call to pull the calling user’s account state (subscription tier, recent orders, open tickets). Read-only.
Output validation and confidence gating. Every Gemini response runs through schema validation and a confidence check. Low confidence triggers human handoff with the conversation context.
The eval design specific to support
Generic eval metrics (BLEU, ROUGE) don’t help here. The metrics that matter:
| Metric | What it measures | Target (varies by domain) |
|---|---|---|
| Resolution rate | % of conversations closed without escalation | 30 to 60% |
| First-response correctness | % of first responses that don’t require correction | 85%+ |
| Hallucination rate | % of responses that contain unsupported factual claims | <1% |
| Escalation correctness | % of escalations to human that should have been escalated | 90%+ |
| CSAT proxy | Sentiment score on customer messages after AI handling | flat or higher vs. human-only baseline |
The hallucination metric is the one where naive setups fail loudly. A support bot that confidently states a refund policy that doesn’t exist is a customer service incident with a press-coverage tail risk. The mitigation is structural: ground every factual claim against Vertex AI Search results, refuse to answer when no relevant result was retrieved, and run an automated post-response check that scans for unsupported claims.
GROUNDING_CHECK_PROMPT = """
The assistant gave this response: {response}
The retrieved knowledge base sources were: {sources}
Identify any factual claims in the response that aren't supported by the sources.
List each unsupported claim, or output "NONE" if all claims are supported.
"""
from google import genai
from google.genai import types
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
def check_grounding(response: str, sources: list[str]) -> list[str]:
check = client.models.generate_content(
model="gemini-2.5-flash",
contents=GROUNDING_CHECK_PROMPT.format(response=response, sources=sources),
config=types.GenerateContentConfig(temperature=0.0),
)
if check.text.strip() == "NONE":
return []
return parse_unsupported_claims(check.text)
Wire this into the eval harness. Every golden conversation gets a grounding check on every assistant turn. Run it on production traffic samples too.
Building the golden set for support
Real customer conversations are the gold standard for the eval set. Some access patterns:
- Sample 200 to 500 closed tickets from the past quarter. Have SMEs annotate the ideal AI response at each turn the customer wrote.
- Bucket by topic, severity, and outcome. Make sure the golden set covers each bucket proportionally to production volume.
- Include adversarial cases: angry customers, ambiguous requests, multi-issue conversations, edge-case policies.
- Include “should escalate” cases. These are the goldens where the right answer is to hand off, not to attempt resolution.
Store goldens in BigQuery, version them, version the annotation guidelines too. See the full eval guide for the harness mechanics.
The metrics that catch problems in production
Pre-launch evals tell you whether to ship. Production metrics tell you whether the system has drifted. The dashboard we set up on every support deployment:
- Conversation volume by intent. Sudden shifts mean either an external event (product launch, outage) or a classifier change.
- Escalation rate by intent. If “billing dispute” escalations jump from 20% to 60%, something changed in the knowledge base or the model behavior.
- Average turns to resolution. Climbing means the agent is struggling.
- Grounding check failure rate. A trailing indicator of hallucination drift.
- Cost per conversation. Token usage drift. Often a sign of repeated tool calls or longer-than-necessary system prompts.
All of these come from BigQuery (fed by Cloud Logging) and live in a Looker Studio dashboard the support ops team owns.
Self-service vs. agent assist: when to move
Most support deployments start as agent assist (the AI suggests; the human approves and sends) and graduate to self-service for narrow, low-risk intents.
The criteria we use to move an intent to self-service:
- Resolution rate above 80% on that intent in agent-assist mode (the human almost always accepts the AI’s draft).
- Hallucination rate under 0.5% on that intent.
- Escalation correctness above 95%: when the model says “I should escalate this,” it’s almost always right.
- The intent isn’t financially or legally risky. Refunds over a threshold, account changes, security-sensitive actions stay in agent-assist regardless of metrics.
Move one intent at a time. Monitor for two weeks before moving another. The system learns from its mistakes; you should too.
Integration with the existing helpdesk
The system is not the helpdesk. It’s a layer in front of (and a tool inside) the helpdesk. Critical integration points:
- Conversation transcripts written to the helpdesk for every AI interaction, including ones that resolved without human involvement. Auditable.
- Handoff to human includes the full conversation context, the AI’s reasoning at the point of escalation, the customer’s emotional state if detectable, and any tool calls the AI made. The human agent shouldn’t be cold-started.
- Agent feedback loop. When the human agent corrects or rejects an AI draft, that correction is logged. Periodically, those corrections feed into the eval golden set.
We’ve integrated with Zendesk, Salesforce Service Cloud, Intercom, and Genesys on various engagements. All of them have webhook and API surfaces sufficient for this pattern.
Cost realities
A reasonable cost benchmark, depending on architecture:
- Gemini 2.5 Flash for generation at $0.15 / $0.60 per 1M input/output tokens, ~$0.0005 to $0.003 per conversation turn (varies with prompt size, grounding context, and whether you run a separate grounding-check pass).
- Vertex AI Search grounding, per-query pricing.
- Dialogflow CX, per session pricing.
For a 1,000-conversation-per-day deployment, monthly costs are typically four-figure to low-five-figure. The deflection economics work out clearly in favor of the system at almost any reasonable agent cost per ticket.
How Accelyze helps
Accelyze designs and builds customer support GenAI systems on Vertex AI, integrated with existing helpdesk platforms. Engagements cover the full reference architecture above, the eval harness specific to support, the production observability, and the agent assist to self-service graduation path. If you’re considering support automation and want a team that has thought through the failure modes as carefully as the deflection rate, get in touch.