- Mar 6, 2026
- 10 min read
A Reference Architecture for Production GenAI on Google Cloud
Most GenAI prototypes look great in a demo. Then someone asks how it’ll handle real traffic, where the logs go, what happens when a model call times out, who owns the prompts, how you’ll evaluate it next quarter, and the conversation gets uncomfortable. The gap between “we have a demo” and “this is in production” is almost entirely architectural. It’s not about better prompts.
This post is the architecture we reach for as a default. It isn’t the only valid one. But it’s coherent, and coherence is what matters when you’re trying to get something past the prototype stage without rewriting half of it.
What’s in the stack
The component inventory:
| Layer | Component | Role |
|---|---|---|
| API / serving | Cloud Run | Stateless request handling, scales to zero |
| Orchestration | Vertex AI Agent Builder / ADK / Agent Engine | Multi-step agentic flows, tool routing |
| Models | Vertex AI (Gemini) | Primary generation and reasoning |
| Model serving (custom) | GKE + vLLM | Open-weights models where cost or residency requires it |
| Retrieval (high-scale) | Vertex AI Vector Search | ANN index for large embedding corpora |
| Retrieval (transactional) | AlloyDB with pgvector | Vector and relational in the same query |
| Search + grounding | Vertex AI Search (Agent Search) | Enterprise search, grounding source for Gemini |
| Analytical substrate | BigQuery | Eval datasets, usage analytics, training corpora |
| Async work | Pub/Sub, Cloud Tasks | Fan-out, retries, decoupled processing |
| Observability | Cloud Logging, Cloud Trace, Vertex AI Gen AI evaluation service (Vertex AI Evals) | Request tracing, model quality, drift |
| Secrets and auth | Secret Manager, IAM, Workload Identity | No static credentials |
| Data perimeter | VPC-SC, CMEK | Enterprise data boundary, customer-managed keys |
| Config | Cloud Storage or Firestore (versioned) | Prompt templates, tool definitions, flags |
A naming note before the diagram: at Cloud Next 2026, Google consolidated Vertex AI’s agent tooling and Agentspace under the Gemini Enterprise Agent Platform umbrella, with first-class sub-products for the Agent Development Kit (ADK), Agent Engine, Agent Studio, and Agent Garden. Existing customers don’t have to migrate, and the console and APIs still surface “Vertex AI” throughout. We keep “Vertex AI” as the working term across this series because that’s how the docs still read; where ADK or Agent Engine is the more specific answer (code-first agents, managed runtime), we say so.
The diagram
┌──────────────────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (web, mobile, internal tools, partner APIs) │
└─────────────────────────────────┬────────────────────────────────────────┘
│ HTTPS / gRPC
IAM auth │ (ID token, or API key via Apigee)
┌─────────────────────────────────▼────────────────────────────────────────┐
│ Cloud Run (API layer, inside VPC-SC perimeter) │
│ Request validation, rate limiting, auth context propagation │
│ Workload Identity into downstream services (no static keys anywhere) │
└─────────┬──────────────────────────────────┬─────────────────────────────┘
│ synchronous │ async fan-out
│ │
┌─────────▼─────────────┐ ┌─────────▼─────────────────────────────┐
│ Vertex AI │ │ Pub/Sub then Cloud Tasks │
│ Agent Builder │ │ (background enrichment, webhook │
│ (tool routing, │ │ callbacks, audit event stream) │
│ multi-step flows) │ └────────────────────────────────────────┘
└─────────┬─────────────┘
│
┌─────────▼─────────────────────────────────────────────────────────────────┐
│ Vertex AI (Gemini) │
│ gemini-2.5-flash or gemini-2.5-pro (gemini-3.1-pro / 3-flash ahead) │
│ Grounding via: Vertex AI Search │ AlloyDB pgvector │ function calls │
│ Model Garden: specialist models (vision, code, embeddings, Claude) │
└─────────┬──────────────────────────┬──────────────────────────────────────┘
│ │
┌─────────▼──────────┐ ┌──────────▼──────────────────────────────────────┐
│ Vertex AI │ │ AlloyDB (PostgreSQL-compatible) │
│ Vector Search │ │ pgvector for embedding storage and ANN │
│ (high-scale ANN, │ │ Relational data colocated with vectors │
│ 10M+ vectors) │ │ Transactional writes from the application layer │
└────────────────────┘ └──────────┬──────────────────────────────────────┘
│
┌──────────▼──────────────────────────────────────┐
│ BigQuery │
│ Eval datasets and golden sets │
│ Usage analytics and cost attribution │
│ Training and fine-tune corpus prep │
│ VECTOR_SEARCH for analytical similarity │
└─────────────────────────────────────────────────┘
Observability (cross-cutting):
Cloud Logging, Log Analytics, BigQuery export
Cloud Trace, latency distribution per model call and tool
Vertex AI Gen AI evaluation service, quality metrics and regression detection
Cloud Monitoring, SLOs and alerting
Security (cross-cutting):
VPC-SC, perimeter around Vertex AI, BigQuery, AlloyDB, Cloud Storage
CMEK, customer-managed encryption for BigQuery, Cloud Storage, AlloyDB
Secret Manager, no static credentials anywhere
IAM and Workload Identity, per-service, per-environment permissions
The seams that actually matter
Where prompts and tool definitions live
Prompts are code. They should be versioned, reviewed, and deployed like config. We keep them in Cloud Storage under a versioned prefix (gs://accelyze-config/prompts/v1.2.3/contract-review.txt) and load them at cold start. Vertex AI Agent Builder tool definitions live next to them. That means rolling back a prompt regression is a config deploy, not a code deploy. It also means your eval harness can pin a prompt version while you iterate on something else.
This is one of those things that looks like overengineering for the first month and saves you twice in the second month.
Where the eval loop closes
Production traffic flows into Cloud Logging. A Pub/Sub subscription routes sampled requests into BigQuery. Vertex AI Evals runs nightly against that dataset and compares current output against goldens plus an LLM-as-judge. Regressions trip a Cloud Monitoring alert before a customer complaint does. See our upcoming evaluation guide for the full harness.
Where caching lives
Three layers. Gemini’s context caching API is the first one to reach for if you have a large repeated system prompt or document. For deterministic responses (FAQ-style stuff), Cloud CDN in front of Cloud Run is dirt cheap and instant. For embeddings, store them in AlloyDB at write time. Never re-embed a document you already have.
AlloyDB vs. Vertex AI Vector Search
Both are first-class on GCP. The rule we use:
- AlloyDB with
pgvector: vectors and relational data in the same query, corpus under about 10M vectors, you want SQL joins between metadata and similarity scores, transactional writes matter. - Vertex AI Vector Search: corpus over about 5M vectors, you need ANN at high QPS, read-heavy and batch-indexed, you want managed HNSW with DiskANN options.
Many systems start on AlloyDB and graduate specific indexes to Vertex AI Vector Search as they scale. They’re complementary. We’ve never had to rip one out to put the other in.
Vector Search 2.0 (late 2025) narrows the operational gap further: indexes, endpoints, and feature serving collapse into a single Collection object, auto-embeddings let you store raw text and have vectors produced server-side, and hybrid (vector + full-text with RRF) is built in. ScaNN is still the engine. The AlloyDB-vs-Vector-Search choice now leans more on “do I need transactional writes and joins?” than on “do I want to babysit an index?”.
End-to-end code sketch
A Cloud Run handler that calls Gemini with vector-retrieved context from AlloyDB, then fires an async event. This is illustrative, not production code.
import os
import asyncpg
from google import genai
from google.genai import types
from google.cloud import pubsub_v1
from fastapi import FastAPI
from pydantic import BaseModel
PROJECT_ID = os.environ["GCP_PROJECT"]
LOCATION = "us-central1"
ALLOYDB_DSN = os.environ["ALLOYDB_DSN"] # injected from Secret Manager at startup
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
publisher = pubsub_v1.PublisherClient()
app = FastAPI()
class QueryRequest(BaseModel):
request_id: str
query: str
user_id: str
async def embed(text: str) -> list[float]:
resp = client.models.embed_content(
model="gemini-embedding-001",
contents=text,
)
return resp.embeddings[0].values
@app.post("/query")
async def handle_query(req: QueryRequest):
query_vec = await embed(req.query)
conn = await asyncpg.connect(ALLOYDB_DSN)
try:
rows = await conn.fetch(
"""
SELECT content, source_url,
1 - (embedding <=> $1::vector) AS score
FROM document_chunks
WHERE 1 - (embedding <=> $1::vector) > 0.75
ORDER BY embedding <=> $1::vector
LIMIT 6
""",
str(query_vec),
)
finally:
await conn.close()
context = "\n\n---\n\n".join(
f"[{row['source_url']}]\n{row['content']}" for row in rows
)
system_prompt = load_prompt("gs://accelyze-config/prompts/v1.0/qa-system.txt")
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[f"Context:\n{context}\n\nQuestion: {req.query}"],
config=types.GenerateContentConfig(
system_instruction=system_prompt,
temperature=0.1,
max_output_tokens=1024,
),
)
publisher.publish(
publisher.topic_path(PROJECT_ID, "query-completions"),
data=response.text.encode(),
request_id=req.request_id,
user_id=req.user_id,
)
return {"answer": response.text, "sources": [row["source_url"] for row in rows]}
A real handler adds structured logging with trace IDs, Workload Identity for AlloyDB auth (no DSN with a password in it), response schema validation, and a circuit breaker around the Gemini call. We left those out so the shape of the flow is visible.
Enterprise data boundaries
If your customer has data residency or perimeter requirements, the architecture above runs inside a VPC-SC perimeter. Practical bits:
VPC-SC keeps all API calls to Vertex AI, BigQuery, AlloyDB, and Cloud Storage inside the perimeter. Data can’t exfiltrate via these APIs, and access from outside the perimeter gets blocked at the control plane.
CMEK lets you encrypt BigQuery datasets, Cloud Storage buckets, and AlloyDB clusters with a Cloud KMS key you own. Rotation and revocation are yours.
Private connectivity: Cloud Run egress routes through a VPC connector. Vertex AI Private Service Connect keeps model calls off the public internet.
Data residency: pick a region (e.g. europe-west4) and enforce it at the org policy level with constraints/gcp.resourceLocations. Verify that the Vertex AI model variants you’re using are available in that region before you commit.
These controls are dramatically easier to add at design time than to retrofit. We scope them in from day one on any enterprise engagement.
The boring stuff that breaks production
Some things rarely appear in architecture diagrams but always appear in post-launch incidents.
Cost attribution. Tag every Cloud Run service, Vertex AI endpoint, and BigQuery dataset with cost-center, env, and feature labels. Use the BigQuery billing export to build a cost-per-feature dashboard. Gemini inference cost is predictable once you know p50 token counts. Embedding cost is negligible. Where surprises happen is vector search QPS at scale.
Quota management. Vertex AI Gemini quotas (RPM and TPM) are per-project. For multi-tenant systems, put per-tenant rate limiting in your Cloud Run layer before you hit the project quota wall. Request quota increases at project setup, not after you hit the ceiling in prod at 9 PM on a Thursday.
Prompt versioning. Every model call should log which prompt template version it used. When a regression shows up in evals, you need to know whether it was a model update, a prompt change, or a retrieval change. Without prompt versioning this turns into a half-day archaeology project.
Graceful degradation. If Vertex AI returns a 429 or 503, the handler should return a structured fallback (“I’m not able to answer that right now, here are some relevant links”) instead of leaking a raw API error. Users should never see a stack trace from an inference call.
What this architecture buys you
When we tell a client “we’ll build you a production-grade GenAI system,” this is what we mean concretely: Cloud Run for the API surface, Vertex AI Agent Builder for orchestration, Gemini for generation, AlloyDB and Vertex AI Vector Search for retrieval, BigQuery as the analytical backbone, Pub/Sub for the async event stream. Evals in the Vertex AI Gen AI evaluation service on top. Logging and tracing in Cloud Logging and Cloud Trace. Perimeter controls in VPC-SC and CMEK if the engagement needs them.
The decisions inside this architecture (which retrieval store, whether to put Agent Builder in front of Gemini or call it directly, when to ground via Vertex AI Search) are the ones we will walk through in our build vs. buy vs. fine-tune framework and during the engagement playbook Phase 1 discovery.
How Accelyze helps
We design and build production GenAI systems on this architecture. The engagement covers the whole stack: API layer, retrieval design, eval harness, security controls. As a Google Cloud partner focused on GenAI, we bring both deep product knowledge of the GCP AI stack and the discipline to take an architecture from a diagram to something running with SLOs against it. If you’re planning a GenAI initiative on Google Cloud and want a team that has shipped this stack before, get in touch.