• Mar 6, 2026
  • 10 min read

A Reference Architecture for Production GenAI on Google Cloud

Most GenAI prototypes look great in a demo. Then someone asks how it’ll handle real traffic, where the logs go, what happens when a model call times out, who owns the prompts, how you’ll evaluate it next quarter, and the conversation gets uncomfortable. The gap between “we have a demo” and “this is in production” is almost entirely architectural. It’s not about better prompts.

This post is the architecture we reach for as a default. It isn’t the only valid one. But it’s coherent, and coherence is what matters when you’re trying to get something past the prototype stage without rewriting half of it.

What’s in the stack

The component inventory:

LayerComponentRole
API / servingCloud RunStateless request handling, scales to zero
OrchestrationVertex AI Agent Builder / ADK / Agent EngineMulti-step agentic flows, tool routing
ModelsVertex AI (Gemini)Primary generation and reasoning
Model serving (custom)GKE + vLLMOpen-weights models where cost or residency requires it
Retrieval (high-scale)Vertex AI Vector SearchANN index for large embedding corpora
Retrieval (transactional)AlloyDB with pgvectorVector and relational in the same query
Search + groundingVertex AI Search (Agent Search)Enterprise search, grounding source for Gemini
Analytical substrateBigQueryEval datasets, usage analytics, training corpora
Async workPub/Sub, Cloud TasksFan-out, retries, decoupled processing
ObservabilityCloud Logging, Cloud Trace, Vertex AI Gen AI evaluation service (Vertex AI Evals)Request tracing, model quality, drift
Secrets and authSecret Manager, IAM, Workload IdentityNo static credentials
Data perimeterVPC-SC, CMEKEnterprise data boundary, customer-managed keys
ConfigCloud Storage or Firestore (versioned)Prompt templates, tool definitions, flags

A naming note before the diagram: at Cloud Next 2026, Google consolidated Vertex AI’s agent tooling and Agentspace under the Gemini Enterprise Agent Platform umbrella, with first-class sub-products for the Agent Development Kit (ADK), Agent Engine, Agent Studio, and Agent Garden. Existing customers don’t have to migrate, and the console and APIs still surface “Vertex AI” throughout. We keep “Vertex AI” as the working term across this series because that’s how the docs still read; where ADK or Agent Engine is the more specific answer (code-first agents, managed runtime), we say so.

The diagram

┌──────────────────────────────────────────────────────────────────────────┐
│                        Client Applications                                │
│               (web, mobile, internal tools, partner APIs)                 │
└─────────────────────────────────┬────────────────────────────────────────┘
                                  │ HTTPS / gRPC
                         IAM auth │ (ID token, or API key via Apigee)
┌─────────────────────────────────▼────────────────────────────────────────┐
│                  Cloud Run  (API layer, inside VPC-SC perimeter)          │
│   Request validation, rate limiting, auth context propagation             │
│   Workload Identity into downstream services (no static keys anywhere)    │
└─────────┬──────────────────────────────────┬─────────────────────────────┘
          │ synchronous                       │ async fan-out
          │                                  │
┌─────────▼─────────────┐          ┌─────────▼─────────────────────────────┐
│  Vertex AI            │          │  Pub/Sub then Cloud Tasks              │
│  Agent Builder        │          │  (background enrichment, webhook       │
│  (tool routing,       │          │   callbacks, audit event stream)       │
│   multi-step flows)   │          └────────────────────────────────────────┘
└─────────┬─────────────┘

┌─────────▼─────────────────────────────────────────────────────────────────┐
│                       Vertex AI  (Gemini)                                  │
│   gemini-2.5-flash  or  gemini-2.5-pro   (gemini-3.1-pro / 3-flash ahead)  │
│   Grounding via: Vertex AI Search │ AlloyDB pgvector │ function calls      │
│   Model Garden: specialist models (vision, code, embeddings, Claude)       │
└─────────┬──────────────────────────┬──────────────────────────────────────┘
          │                          │
┌─────────▼──────────┐   ┌──────────▼──────────────────────────────────────┐
│ Vertex AI          │   │  AlloyDB  (PostgreSQL-compatible)                 │
│ Vector Search      │   │  pgvector for embedding storage and ANN           │
│ (high-scale ANN,   │   │  Relational data colocated with vectors           │
│  10M+ vectors)     │   │  Transactional writes from the application layer  │
└────────────────────┘   └──────────┬──────────────────────────────────────┘

                          ┌──────────▼──────────────────────────────────────┐
                          │  BigQuery                                         │
                          │  Eval datasets and golden sets                    │
                          │  Usage analytics and cost attribution             │
                          │  Training and fine-tune corpus prep               │
                          │  VECTOR_SEARCH for analytical similarity          │
                          └─────────────────────────────────────────────────┘

Observability (cross-cutting):
  Cloud Logging, Log Analytics, BigQuery export
  Cloud Trace, latency distribution per model call and tool
  Vertex AI Gen AI evaluation service, quality metrics and regression detection
  Cloud Monitoring, SLOs and alerting

Security (cross-cutting):
  VPC-SC, perimeter around Vertex AI, BigQuery, AlloyDB, Cloud Storage
  CMEK, customer-managed encryption for BigQuery, Cloud Storage, AlloyDB
  Secret Manager, no static credentials anywhere
  IAM and Workload Identity, per-service, per-environment permissions

The seams that actually matter

Where prompts and tool definitions live

Prompts are code. They should be versioned, reviewed, and deployed like config. We keep them in Cloud Storage under a versioned prefix (gs://accelyze-config/prompts/v1.2.3/contract-review.txt) and load them at cold start. Vertex AI Agent Builder tool definitions live next to them. That means rolling back a prompt regression is a config deploy, not a code deploy. It also means your eval harness can pin a prompt version while you iterate on something else.

This is one of those things that looks like overengineering for the first month and saves you twice in the second month.

Where the eval loop closes

Production traffic flows into Cloud Logging. A Pub/Sub subscription routes sampled requests into BigQuery. Vertex AI Evals runs nightly against that dataset and compares current output against goldens plus an LLM-as-judge. Regressions trip a Cloud Monitoring alert before a customer complaint does. See our upcoming evaluation guide for the full harness.

Where caching lives

Three layers. Gemini’s context caching API is the first one to reach for if you have a large repeated system prompt or document. For deterministic responses (FAQ-style stuff), Cloud CDN in front of Cloud Run is dirt cheap and instant. For embeddings, store them in AlloyDB at write time. Never re-embed a document you already have.

Both are first-class on GCP. The rule we use:

  • AlloyDB with pgvector: vectors and relational data in the same query, corpus under about 10M vectors, you want SQL joins between metadata and similarity scores, transactional writes matter.
  • Vertex AI Vector Search: corpus over about 5M vectors, you need ANN at high QPS, read-heavy and batch-indexed, you want managed HNSW with DiskANN options.

Many systems start on AlloyDB and graduate specific indexes to Vertex AI Vector Search as they scale. They’re complementary. We’ve never had to rip one out to put the other in.

Vector Search 2.0 (late 2025) narrows the operational gap further: indexes, endpoints, and feature serving collapse into a single Collection object, auto-embeddings let you store raw text and have vectors produced server-side, and hybrid (vector + full-text with RRF) is built in. ScaNN is still the engine. The AlloyDB-vs-Vector-Search choice now leans more on “do I need transactional writes and joins?” than on “do I want to babysit an index?”.

End-to-end code sketch

A Cloud Run handler that calls Gemini with vector-retrieved context from AlloyDB, then fires an async event. This is illustrative, not production code.

import os
import asyncpg
from google import genai
from google.genai import types
from google.cloud import pubsub_v1
from fastapi import FastAPI
from pydantic import BaseModel

PROJECT_ID = os.environ["GCP_PROJECT"]
LOCATION = "us-central1"
ALLOYDB_DSN = os.environ["ALLOYDB_DSN"]  # injected from Secret Manager at startup

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
publisher = pubsub_v1.PublisherClient()
app = FastAPI()

class QueryRequest(BaseModel):
    request_id: str
    query: str
    user_id: str

async def embed(text: str) -> list[float]:
    resp = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
    )
    return resp.embeddings[0].values

@app.post("/query")
async def handle_query(req: QueryRequest):
    query_vec = await embed(req.query)

    conn = await asyncpg.connect(ALLOYDB_DSN)
    try:
        rows = await conn.fetch(
            """
            SELECT content, source_url,
                   1 - (embedding <=> $1::vector) AS score
            FROM document_chunks
            WHERE 1 - (embedding <=> $1::vector) > 0.75
            ORDER BY embedding <=> $1::vector
            LIMIT 6
            """,
            str(query_vec),
        )
    finally:
        await conn.close()

    context = "\n\n---\n\n".join(
        f"[{row['source_url']}]\n{row['content']}" for row in rows
    )

    system_prompt = load_prompt("gs://accelyze-config/prompts/v1.0/qa-system.txt")

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[f"Context:\n{context}\n\nQuestion: {req.query}"],
        config=types.GenerateContentConfig(
            system_instruction=system_prompt,
            temperature=0.1,
            max_output_tokens=1024,
        ),
    )

    publisher.publish(
        publisher.topic_path(PROJECT_ID, "query-completions"),
        data=response.text.encode(),
        request_id=req.request_id,
        user_id=req.user_id,
    )

    return {"answer": response.text, "sources": [row["source_url"] for row in rows]}

A real handler adds structured logging with trace IDs, Workload Identity for AlloyDB auth (no DSN with a password in it), response schema validation, and a circuit breaker around the Gemini call. We left those out so the shape of the flow is visible.

Enterprise data boundaries

If your customer has data residency or perimeter requirements, the architecture above runs inside a VPC-SC perimeter. Practical bits:

VPC-SC keeps all API calls to Vertex AI, BigQuery, AlloyDB, and Cloud Storage inside the perimeter. Data can’t exfiltrate via these APIs, and access from outside the perimeter gets blocked at the control plane.

CMEK lets you encrypt BigQuery datasets, Cloud Storage buckets, and AlloyDB clusters with a Cloud KMS key you own. Rotation and revocation are yours.

Private connectivity: Cloud Run egress routes through a VPC connector. Vertex AI Private Service Connect keeps model calls off the public internet.

Data residency: pick a region (e.g. europe-west4) and enforce it at the org policy level with constraints/gcp.resourceLocations. Verify that the Vertex AI model variants you’re using are available in that region before you commit.

These controls are dramatically easier to add at design time than to retrofit. We scope them in from day one on any enterprise engagement.

The boring stuff that breaks production

Some things rarely appear in architecture diagrams but always appear in post-launch incidents.

Cost attribution. Tag every Cloud Run service, Vertex AI endpoint, and BigQuery dataset with cost-center, env, and feature labels. Use the BigQuery billing export to build a cost-per-feature dashboard. Gemini inference cost is predictable once you know p50 token counts. Embedding cost is negligible. Where surprises happen is vector search QPS at scale.

Quota management. Vertex AI Gemini quotas (RPM and TPM) are per-project. For multi-tenant systems, put per-tenant rate limiting in your Cloud Run layer before you hit the project quota wall. Request quota increases at project setup, not after you hit the ceiling in prod at 9 PM on a Thursday.

Prompt versioning. Every model call should log which prompt template version it used. When a regression shows up in evals, you need to know whether it was a model update, a prompt change, or a retrieval change. Without prompt versioning this turns into a half-day archaeology project.

Graceful degradation. If Vertex AI returns a 429 or 503, the handler should return a structured fallback (“I’m not able to answer that right now, here are some relevant links”) instead of leaking a raw API error. Users should never see a stack trace from an inference call.

What this architecture buys you

When we tell a client “we’ll build you a production-grade GenAI system,” this is what we mean concretely: Cloud Run for the API surface, Vertex AI Agent Builder for orchestration, Gemini for generation, AlloyDB and Vertex AI Vector Search for retrieval, BigQuery as the analytical backbone, Pub/Sub for the async event stream. Evals in the Vertex AI Gen AI evaluation service on top. Logging and tracing in Cloud Logging and Cloud Trace. Perimeter controls in VPC-SC and CMEK if the engagement needs them.

The decisions inside this architecture (which retrieval store, whether to put Agent Builder in front of Gemini or call it directly, when to ground via Vertex AI Search) are the ones we will walk through in our build vs. buy vs. fine-tune framework and during the engagement playbook Phase 1 discovery.

How Accelyze helps

We design and build production GenAI systems on this architecture. The engagement covers the whole stack: API layer, retrieval design, eval harness, security controls. As a Google Cloud partner focused on GenAI, we bring both deep product knowledge of the GCP AI stack and the discipline to take an architecture from a diagram to something running with SLOs against it. If you’re planning a GenAI initiative on Google Cloud and want a team that has shipped this stack before, get in touch.