- Mar 26, 2026
- 10 min read
Build vs. Buy vs. Fine-Tune: A Decision Framework for Enterprise GenAI on Vertex AI
“Should we fine-tune the model?” is almost never the right first question. The right first questions are: what are we trying to do, what data do we have, what does good enough look like, and what are the latency, cost, and governance constraints. The answer to “should we fine-tune” falls out of those. Usually, it’s no.
This post is the decision framework we use at Accelyze when scoping a new GenAI system on Vertex AI. It’s a decision tree, not a formula. The right answer depends on specifics, and we’ll say so where the specifics matter.
The ladder
We think of implementation options as a ladder. Each rung is more complex, more expensive, and more powerful than the one below. The goal is to find the lowest rung that meets your requirements.
Rung 6: Open-weights model on GKE (Llama, Mistral, Gemma, Claude via Vertex)
│
Rung 5: Supervised fine-tune on Vertex AI (full fine-tune or LoRA adapter)
│
Rung 4: Retrieval-augmented generation (RAG) with a custom pipeline
│
Rung 3: Gemini plus grounding via Vertex AI Search (Agent Search)
│
Rung 2: Gemini plus prompt engineering (few-shot, chain-of-thought, system prompt)
│
Rung 1: Stock Gemini (zero-shot, minimal prompt)
Most enterprise use cases land on rungs 2 through 4. Rungs 5 and 6 are correct in specific circumstances. Those circumstances are narrower than marketing would suggest.
Rung 1: Stock Gemini
When to use it: exploration, baseline measurement, cases where the task sits well within Gemini’s pre-training distribution.
Gemini 2.5 Flash and Gemini 2.5 Pro (and Gemini 3.1 Pro on the reasoning frontier) are extraordinarily capable out of the box. Summarization, classification, extraction, translation, code generation, question-answering on common domains. These often need nothing more than a well-formed prompt. Step one in every engagement is measuring how well stock Gemini performs on your eval set (see how to evaluate a GenAI application). If it clears your threshold, stop there.
Cost: lowest. Gemini 2.5 Flash runs at roughly $0.15 per 1M input tokens and $0.60 per 1M output tokens (verify current pricing in the GCP console; 2.5 Flash-Lite and the 3.1 Flash-Lite preview are cheaper still for high-throughput tiers). For most enterprise query volumes, the model spend is negligible next to engineering cost.
Latency: Gemini 2.5 Flash, p50 around 600ms to 1.5s with thinking disabled. Gemini 2.5 Pro is higher latency, use when quality justifies it.
Governance: Google Cloud enterprise data protection applies. Input and output aren’t used to train Google’s models under enterprise terms. Data residency is enforced by the regional endpoint you call.
Rung 2: Gemini plus prompt engineering
When to use it: when Rung 1 is close but not quite there, you have clear examples of desired behavior, the task needs consistent format or tone, or it requires multi-step reasoning.
Prompt engineering on Gemini is significantly more powerful than it was on earlier LLM generations. A well-constructed system prompt with a few-shot block can close most quality gaps short of actual domain knowledge limits.
What works:
- System prompt. Persona, output format constraints, behavioral guardrails, domain context.
- Few-shot examples. 3 to 10 input/output pairs in the prompt. Pick examples that cover the failure modes, not just the easy cases.
- Chain-of-thought. For reasoning tasks, asking the model to show its work before the final answer (“think step by step”) meaningfully improves accuracy on multi-step problems.
- Structured output mode. Gemini supports constrained JSON via response schema. Use it for extraction tasks. It eliminates parsing failures and output format drift.
Cost, latency, and governance are the same as Rung 1, with modestly higher cost from longer prompts.
Rung 3: Gemini plus grounding via Vertex AI Search
When to use it: when the task needs knowledge that’s proprietary, recent, or not in Gemini’s training data. When you want cited, verifiable sources. When a hallucinated factual claim is unacceptable.
Vertex AI Search with grounding is the fastest path to a knowledge-grounded Gemini application. You ingest documents (PDF, HTML, BigQuery, Cloud Storage) into a Vertex AI Search data store, and Gemini retrieves and cites them at inference time via the grounding API. You don’t manage embeddings, an index, or a retrieval pipeline. Vertex AI Search does it.
Right choice when:
- Corpus is relatively static
- You want citation links in responses
- You need enterprise search behavior (ranking, boosts, metadata filters)
- You want a managed solution with minimal operational overhead
Not enough when:
- You need sub-second retrieval latency at very high QPS
- Your retrieval logic is complex (multi-hop, hybrid BM25 plus vector, graph traversal)
- You need to join retrieved content with relational data
- You want full control over chunking, embedding, and ranking
For those cases, move to Rung 4.
Cost: Vertex AI Search has ingestion (per GB) and query (per request) pricing. At moderate query volumes, substantially cheaper than building a custom pipeline.
Latency: adds 200 to 600ms to generation latency depending on corpus size and document structure.
Rung 4: Retrieval-augmented generation with a custom pipeline
When to use it: when Rung 3 doesn’t give you enough control over retrieval quality, latency, or logic. When you need hybrid search. When your retrieval is multi-hop or graph-structured.
This is where most enterprise GenAI systems end up. The pipeline: embed queries, retrieve from AlloyDB pgvector or Vertex AI Vector Search, rerank, inject into Gemini’s context. See our five RAG patterns on Google Cloud for the full pattern library.
Custom RAG is more powerful than Rung 3, but it carries operational cost. You own the embedding pipeline, the index, the retrieval logic, the performance tuning. That’s engineering work, not configuration work. It’s the right trade when retrieval quality is the difference between a system that works and one that doesn’t.
Cost: embedding costs are low (gemini-embedding-001 is $0.15 per 1M input tokens, with batch pricing half that). Index cost depends on corpus size and QPS. Dominant cost is usually Gemini generation, not retrieval.
Latency: with AlloyDB plus pgvector, retrieval adds 50 to 200ms at typical corpus sizes. With Vertex AI Vector Search, ANN retrieval comes in under 50ms at scale.
Rung 5: Supervised fine-tuning on Vertex AI
When to use it: specific conditions all need to be true, not just one or two.
Fine-tuning is correct when:
- The task is outside Gemini’s pre-training distribution. Not “we want better performance” but “the style, vocabulary, or domain is genuinely outside what the model has seen.”
- You have high-quality labeled data. Fine-tuning on a small or noisy dataset degrades performance. Rule of thumb: at least 500 high-quality examples for an adapter, thousands for a full fine-tune.
- Rungs 1 through 4 have been measured and fall short. Not assumed to fall short. Measured, against your eval harness.
- The gain justifies the cost and the maintenance burden. A fine-tuned model needs to be re-fine-tuned when the base model is updated. That’s ongoing engineering cost.
Vertex AI supports supervised fine-tuning (full and LoRA adapter) for Gemini models via the tuning API — now surfaced under client.tunings in the unified google-genai SDK. The process: upload training data in JSONL to Cloud Storage, submit a tuning job, deploy the tuned model to a Vertex AI endpoint.
Things fine-tuning is not good for:
- Adding new knowledge (use RAG instead)
- Fixing factual errors in the base model’s training (use grounding instead)
- Improving performance on tasks the model can already do with better prompting (measure first)
Cost: tuning compute is charged per token during training, plus ongoing endpoint hosting for the tuned model. Substantially higher than RAG for most use cases.
Latency: same as the base model, since you’re serving a new model checkpoint. LoRA adapters can be served more efficiently than full fine-tunes on shared infrastructure.
Rung 6: Open-weights models on GKE
When to use it: data sovereignty that prevents sending data to Google’s inference endpoints, cost optimization at very high token volumes, specialized model families not in Vertex AI Model Garden, or use cases where open-weights demonstrably outperform Gemini (narrow code, specific languages, etc.).
GKE with GPU node pools (A100, H100, L4) running vLLM or TGI is the standard pattern. Models include Llama 4, Mistral, Gemma 3, DeepSeek, and others. Vertex AI Model Garden offers pre-optimized containers for GKE deployment.
Anthropic Claude is also available on Vertex AI through Model Garden — including Claude Opus 4.7 (GA April 2026) as the most capable tier, Sonnet 4.6 as the cost/quality default, and Opus 4.6 still available, all with 1M-token context. Anthropic serves the model, GCP bills it. This matters for use cases where Claude’s long-context reasoning or instruction following beats Gemini. We’ll recommend it when it’s the right tool. See multi-model architectures on GCP for that discussion.
Cost: GPU compute is expensive. L4 is the cost-efficient inference tier. A100s and H100s are for high-throughput or very large models. At low to moderate volumes, Rung 6 is almost never cheaper than Vertex AI managed inference.
Latency: controllable, you own the serving stack. With batching and quantization, you can hit very competitive numbers.
Operational burden: high. GPU node pool, serving container, model updates, autoscaling, hardware quota. That’s infrastructure engineering, not model engineering.
Worked example: a contract review assistant
Let’s run a hypothetical through the ladder.
Use case. A legal team wants an assistant that reviews commercial contracts. Sample queries: “what’s the termination clause,” “does this include a limitation of liability,” “flag any clauses that deviate from our standard template.”
Data. 2,000 contracts in PDF. Standard template on file. Legal team can produce 200 annotated examples of “deviation flags” with ground truth.
Eval harness (built first). 150 golden Q/A pairs across all three query types. Success threshold: 80% semantic accuracy on LLM-as-judge, hallucination rate under 5% on clause existence claims.
Rung 1 measurement. Stock Gemini Flash, full contract in context. Result: 71% accuracy, 12% hallucination. Below threshold.
Rung 2 measurement. Add a system prompt with legal query patterns and three few-shot examples. Result: 76% accuracy, 8% hallucination. Closer, still below.
Rung 3 measurement. Ingest contracts into Vertex AI Search. Ground Gemini against retrieved clauses with citations. Result: 83% accuracy, 3% hallucination. Passes.
Decision: stop at Rung 3. It meets the threshold, the infrastructure is managed, and it cites its sources (important for legal users). No custom RAG pipeline, no fine-tuning.
What if Rung 3 hadn’t passed? We’d move to Rung 4: a custom RAG pipeline over AlloyDB with contract-specific chunking (by clause rather than by character count) and hybrid search combining BM25 keyword matching (for clause labels like “limitation of liability”) with vector similarity. That likely pushes accuracy past 85%.
Fine-tuning (Rung 5) wouldn’t be our next move even then, because the gap is in retrieval, not in the model’s reasoning. Fine-tuning a model that’s retrieving the wrong context doesn’t fix retrieval.
The governance overlay
Enterprise buyers have governance requirements that can push you up or down the ladder regardless of quality:
| Requirement | Implication |
|---|---|
| Data must not leave EU | Vertex AI regional endpoints in europe-west4. AlloyDB in the same region. Confirm with Google’s DPA. |
| Cannot send data to third-party APIs | Rules out Anthropic via their direct API. Anthropic on Vertex AI is fine (GCP DPA applies). |
| Must be explainable / auditable | RAG with cited sources (Rung 3 or 4) is far more auditable than a fine-tuned black box. |
| SOC 2 or ISO 27001 required | GCP is certified. VPC-SC perimeter around all data services. |
| IP ownership of the fine-tuned model | Vertex AI fine-tuned models are customer-owned. Read the terms carefully on base model weights. |
How to use this in practice
Start at Rung 1. Build the eval harness first (see the engagement playbook). Measure Rung 1 against the eval. If it passes, you’re done. If not, climb to Rung 2 and measure again. Keep going until you clear the threshold.
The instinct to jump to Rung 5 (“we need to fine-tune on our proprietary data”) is understandable but almost always premature. Fine-tuning is a real investment in engineering time and ongoing maintenance. Rungs 2 through 4 are faster, cheaper, and easier to keep running. They’re the right answer most of the time.
How Accelyze helps
We run this framework, including building the eval harness, measuring each rung, and recommending the lowest-complexity approach that meets your requirements, as part of the standard engagement playbook. If you’re unsure whether your use case needs fine-tuning, grounding, or a custom RAG pipeline, that’s exactly the question we answer in Phase 1 and Phase 2 of an engagement. Get in touch to talk through your use case.