Five RAG Patterns We Keep Building on Google Cloud, and When Each One Is Right

“Just add RAG” has become the default answer to GenAI quality problems, and like most default answers it’s right often enough to be dangerous. Most production retrieval pipelines we’ve seen in the wild are built on a vague memory of a tutorial: chunk the docs, embed them, top-k cosine similarity, stuff into the prompt. That’s the naive pattern. It works for some use cases and quietly fails on most others.

This post walks through the five patterns we reach for at Accelyze, mapped to the GCP components that implement them. The point isn’t to pick the most sophisticated one. It’s to pick the simplest one that meets your retrieval quality bar.

The patterns at a glance

Pattern	When it’s right	GCP components
Naive RAG	Small static corpus, simple Q&A, low latency budget	AlloyDB `pgvector` or Vertex AI Vector Search
Hybrid (BM25 + vector)	Exact-match terms matter (codes, names, clause labels)	AlloyDB with `pgvector` + `tsvector`, or Vertex AI Search (Agent Search)
Agentic RAG	Queries that need decomposition or multi-source lookup	Vertex AI Agent Builder / ADK (part of the Gemini Enterprise Agent Platform) + Gemini function calling
GraphRAG	Highly connected data, entity relationships matter	Spanner Graph or Neo4j Aura on GCP, plus AlloyDB
Multi-hop / query decomposition	Questions that span multiple documents or facts	Gemini reasoning + iterative retrieval

Pattern 1: Naive RAG

The pattern everyone starts with. Chunk documents, embed them, store the embeddings, retrieve top-k by cosine similarity, inject into the prompt.

User query
   │
   ├──► Embed (gemini-embedding-001)
   │
   ├──► ANN search (pgvector or Vertex AI Vector Search) → top-k chunks
   │
   └──► Gemini generation with chunks in context

When it works: small to medium corpora (under 100K chunks), Q&A on relatively self-contained passages, latency budget under a second, no need for exact-term matching, no entity relationships.

When it falls over:

Queries that depend on a specific code, model number, or product SKU. Pure semantic similarity will miss these.
Multi-document synthesis. Top-k retrieval is per-query, single-shot. It can’t reason about what to retrieve next.
Long-tail queries on technical content. Embeddings are biased toward training data distribution. Niche domains underperform.

Python side, naive RAG against AlloyDB looks like this:

async def naive_rag(query: str, conn) -> list[dict]:
    embedding = await embed(query)
    rows = await conn.fetch(
        """
        SELECT content, source_url,
               1 - (embedding <=> $1::vector) AS score
        FROM document_chunks
        ORDER BY embedding <=> $1::vector
        LIMIT 8
        """,
        str(embedding),
    )
    return [dict(r) for r in rows]

TypeScript against Vertex AI Vector Search (still served by MatchServiceClient in the v1 SDK — Vector Search 2.0 adds a Collections API alongside this, see below):

import { MatchServiceClient } from "@google-cloud/aiplatform";

const client = new MatchServiceClient({ apiEndpoint: "..." });

async function naiveRag(query: string) {
  const queryEmbedding = await embed(query);
  const [response] = await client.findNeighbors({
    indexEndpoint: "projects/.../indexEndpoints/...",
    deployedIndexId: "...",
    queries: [{
      datapoint: { featureVector: queryEmbedding },
      neighborCount: 8,
    }],
  });
  return response.nearestNeighbors[0].neighbors;
}

The embed() helper above wraps the unified @google/genai SDK — the TypeScript counterpart of google-genai. Calling gemini-embedding-001 looks like:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ vertexai: true, project: PROJECT, location: LOCATION });

async function embed(text: string): Promise<number[]> {
  const resp = await ai.models.embedContent({
    model: "gemini-embedding-001",
    contents: text,
  });
  return resp.embeddings![0].values!;
}

Pattern 2: Hybrid (BM25 + vector)

The single highest-leverage upgrade from naive RAG, and almost always worth implementing.

The insight: semantic similarity is good for meaning, bad for exact strings. BM25 (or any keyword/lexical scoring) is good for exact strings, bad for meaning. Combine them. Run both retrievers in parallel, fuse the rankings (Reciprocal Rank Fusion works well as a default), return the merged top-k.

User query
   │
   ├──► Vector retrieval ──┐
   │                       ├──► RRF rank fusion ──► top-k ──► Gemini
   └──► BM25 retrieval ────┘

On AlloyDB, both retrievers live in the same query. PostgreSQL has full-text search via tsvector and tsquery built in. You can run lexical and vector retrieval in the same SQL statement:

WITH vector_results AS (
  SELECT id, content,
         1 - (embedding <=> $1::vector) AS vec_score,
         row_number() OVER (ORDER BY embedding <=> $1::vector) AS vec_rank
  FROM chunks
  ORDER BY embedding <=> $1::vector
  LIMIT 50
),
lexical_results AS (
  SELECT id, content,
         ts_rank_cd(tsv, websearch_to_tsquery($2)) AS lex_score,
         row_number() OVER (ORDER BY ts_rank_cd(tsv, websearch_to_tsquery($2)) DESC) AS lex_rank
  FROM chunks
  WHERE tsv @@ websearch_to_tsquery($2)
  ORDER BY lex_score DESC
  LIMIT 50
)
SELECT COALESCE(v.id, l.id) AS id,
       COALESCE(v.content, l.content) AS content,
       (1.0 / (60 + COALESCE(v.vec_rank, 1000))) +
       (1.0 / (60 + COALESCE(l.lex_rank, 1000))) AS rrf_score
FROM vector_results v
FULL OUTER JOIN lexical_results l USING (id)
ORDER BY rrf_score DESC
LIMIT 8;

That’s a single query, no application-layer fusion code, and it leans on AlloyDB’s query planner.

For Vertex AI Search, hybrid retrieval is a configuration option on the data store. The product handles BM25 plus vector blending and surfaces a single ranked list. If you’re using it anyway, turn it on.

Vertex AI Vector Search 2.0 (late 2025) closes most of the gap with Vertex AI Search for self-built retrieval: the new Collections API gives you auto-embeddings (store text, vectors are produced server-side), built-in full-text search alongside vector, native hybrid with RRF in a single query, and self-tuning ANN parameters. If you’d otherwise be choosing between hand-rolling RRF on AlloyDB and running Vertex AI Search, Vector Search 2.0 is now a credible third option that removes a lot of the operational lift.

When hybrid is the right pick: queries that mix conceptual and exact-match terms (“what does our policy say about MFA on AWS root accounts”), domains with lots of identifiers (legal clauses, product SKUs, regulatory codes), corpora where naive RAG misses obvious matches.

Pattern 3: Agentic RAG

Single-shot retrieval is a limit. Agentic RAG turns retrieval into a tool the model can call multiple times, with different queries, deciding what to fetch next based on what it’s already seen.

User query
   │
   ▼
┌──────────────────────────────────────────┐
│  Agent (Gemini via Agent Builder)        │
│                                          │
│   Plan ──► retrieve_v1(query='X')        │
│             │                            │
│             ▼                            │
│   Reason ──► retrieve_v2(query='Y')      │
│             │                            │
│             ▼                            │
│   Answer with synthesized context        │
└──────────────────────────────────────────┘

Vertex AI Agent Builder — and Agent Engine as the managed runtime — is the natural home for this. You expose retrieval as a function (or several functions, one per index or data source), Gemini’s function calling decides when and what to retrieve, the agent loop handles the multi-turn reasoning.

Worth building when:

A single query embedding doesn’t capture what to retrieve (the user’s actual information need is implicit)
You have multiple knowledge sources with different retrieval semantics (a code repo, a wiki, a ticketing system)
Some queries can be answered without retrieval at all, and the agent should know when to skip it

Common failure modes:

Latency multiplication. Each retrieval round-trip adds seconds. Three rounds and you’re at 10 seconds total.
Cost. Multiple Gemini calls per user query.
Infinite loops or runaway tool use. Set a max-turns limit, log every tool call, monitor in Cloud Trace.

See our agent guardrails post for the discipline around this.

Pattern 4: GraphRAG

When relationships between entities matter, vector search alone misses the structure. GraphRAG (the Microsoft Research framing of an older idea) builds a knowledge graph from your corpus and uses both graph traversal and vector retrieval to assemble context.

Documents ──► entity + relation extraction (Gemini)
   │
   ▼
Graph store (Spanner Graph / Neo4j on GCP Marketplace)
   │
   │   ┌─────────────────────────────────────────┐
   ▼   ▼                                          │
User query ──► retrieve community summaries ──►   │
              + traverse to relevant entities ──► │
              + vector retrieve supporting docs ──┘
                              │
                              ▼
                         Gemini answer

Right pick when: highly entity-centric domains (M&A intelligence, fraud investigation, scientific literature, org charts), questions like “how is X connected to Y,” “what’s the chain of accountability between these two events,” “summarize the relationships in this region of the data.”

Heavyweight to build. Don’t reach for it unless naive and hybrid have measurably failed and the retrieval failure is about structure, not similarity.

Spanner Graph (now GA) is the GCP-native transactional option, and BigQuery Graph — which shares Spanner Graph’s schema — is the analytical complement when you want to run graph queries alongside warehouse-scale data. Neo4j Aura on Google Cloud Marketplace is the more mature graph database if you can tolerate the third-party dependency. Entity and relation extraction is a Gemini job. Run it asynchronously via Cloud Run jobs or Cloud Workflows.

Pattern 5: Multi-hop / query decomposition

When the user’s question requires answering several sub-questions first. “Did we ship more units of product X in Q3 than we forecasted in our Q2 board deck” needs: the Q3 unit number, the Q2 forecast, and a comparison.

Original query
   │
   ▼
Gemini: decompose into sub-questions
   │
   ├─► Retrieve for sub-question 1 ──► partial answer 1
   ├─► Retrieve for sub-question 2 ──► partial answer 2
   └─► ...
   │
   ▼
Gemini: synthesize partial answers into final response

This is structurally similar to agentic RAG but more constrained. The agent doesn’t decide what to retrieve; the planner does, up front. Easier to reason about, easier to bound latency.

When it’s right: business intelligence Q&A, complex compliance questions, anything where the answer is a synthesis of facts that live in different places.

Implementation: Gemini for the decomposition step, then a parallel fan-out of retrieval calls (using asyncio.gather in Python or Promise.all in TypeScript), then a final Gemini synthesis call. The parallel fan-out is what keeps latency in check compared to agentic RAG’s sequential turn loop.

Choosing between AlloyDB pgvector and Vertex AI Vector Search

This question comes up in every engagement.

Factor	AlloyDB pgvector	Vertex AI Vector Search
Corpus size sweet spot	Up to ~10M vectors	5M+, scales to billions
Latency at 95th percentile	50 to 200ms	Under 50ms at scale
Hybrid search built in	Yes, via tsvector	Via Vertex AI Search product
Transactional writes	Yes, ACID	No, batch indexing
Joins with relational data	Native SQL	Application-layer join
Operational model	Managed PostgreSQL	Fully managed ANN service
Best at	Application-integrated retrieval	High-QPS, read-heavy retrieval

A common pattern in practice: start on AlloyDB while the system is forming, graduate specific high-traffic indexes to Vertex AI Vector Search once the access patterns stabilize. They live together fine.

Common failure modes across all patterns

A few things we see fail in production regardless of pattern:

Chunking strategy not thought through. Default character-count chunking destroys semantic boundaries. Chunk by structure where you can (clauses, paragraphs, sections). Test multiple chunk sizes against your eval set. There’s no universal right answer.

Embedding model never revisited. gemini-embedding-001 is the current default — it leads MTEB-Multilingual and outputs 3072 dims by default (truncatable to 1536 or 768 via Matryoshka Representation Learning) — but it’s not a one-size-fits-all. Multilingual corpora, code, or domain-specialized text often benefit from different embedding models. Evaluate, don’t assume.

No reranking. Top-k retrieval at the embedding level is a coarse filter. A reranker on the top 30 results, returning top 5, makes a measurable quality difference. The managed option is the Vertex AI ranking API (in Discovery Engine); the do-it-yourself options are a cross-encoder model or Gemini Flash scoring chunk relevance directly. Often cheap.

Stale corpus. Whoever wrote the ingestion job is usually not the one who notices three months later that it stopped running. Monitor ingestion freshness as part of your observability. Alert on documents not updated within a defined window.

No retrieval evaluation. End-to-end evals catch generation problems. Retrieval-specific evals (was the right chunk in the top-k) catch the upstream problem. Build both. See our evaluation guide.

How Accelyze helps

We design and build retrieval pipelines for GenAI applications across all five patterns. Engagements typically start with the naive baseline, measure against the eval harness, and graduate to hybrid, agentic, or graph patterns where the data justifies it. We cover the whole stack: chunking strategy, embedding model selection, AlloyDB or Vertex AI Vector Search setup, reranking, retrieval eval design. If your RAG isn’t performing the way you expected, get in touch.

GenAI Strategy & Readiness

Pilot to Production Delivery

MLOps & Platform Enablement

GenAI Risk & Governance