- May 2, 2026
- 8 min read
Multi-Model GenAI Architectures on GCP: When Gemini Isn't the Right Tool
We work on Google Cloud and most of what we build runs on Gemini. None of that changes the fact that for some workloads Gemini isn’t the best tool, and we’ll say so to clients because we’d rather have the credibility than the alignment points.
Buyers can feel the difference between consultancy and salesmanship within the first 20 minutes of a scoping call. The consultancies that earn trust are the ones that can defend why a particular model is the right one for a particular workload, not the ones that route everything through the model their slide deck happens to feature.
This post is the model selection framework we use, and the patterns we deploy when a system needs more than one model.
What’s available on Vertex AI
Vertex AI is the unified entry point. You can route to multiple model families from a single auth boundary, with the same enterprise data protection, the same monitoring stack, the same billing.
Google models
- Gemini 2.5 Flash (GA). Cheap, fast, very strong general capability with thinking enabled by default. The workhorse — our default for anything that doesn’t need more.
- Gemini 2.5 Pro (GA). Higher reasoning capability, longer context, slower and more expensive. Use when Flash falls short.
- Gemini 3.1 Pro and Gemini 3 Flash (preview). The forward path — stronger reasoning, more capable agent behaviour. Worth piloting now; the 2.5 line has a retirement date of 2026-10-16 so plan the move.
- Gemini 2.5 Flash-Lite (GA) and Gemini 3.1 Flash-Lite (preview). Cheapest tier, high-throughput classification, routing, lightweight summarisation.
- Gemini 2.5 Flash-Image. Inline image generation in a regular text turn.
- Gemini Embedding 001 (GA). The default embedding model — 3072 dims by default, truncatable to 1536 or 768, leads MTEB-Multilingual.
- Imagen, Veo, Lyria. Image, video, audio generation.
Anthropic via Model Garden
- Claude Opus 4.7 (GA April 2026). Anthropic’s most capable model. The right pick when you need the strongest Claude — deep reasoning, hardest agentic workflows.
- Claude Sonnet 4.6. The cost/quality default and the model in the code sample below.
- Claude Opus 4.6. Still available for workloads already running on it. All three carry 1M-token context on Vertex.
- Available natively in Vertex AI. Anthropic serves them, GCP bills them, enterprise data protection applies. No data leaves your VPC-SC perimeter.
Open-weights via Model Garden
- Llama 4, Mistral, Gemma 3, DeepSeek, others. Available as Vertex AI managed endpoints or for self-hosted deployment on GKE with pre-built containers.
Specialist models
- Third-party specialists in Model Garden (Mistral Codestral, others). The previously-listed Codey / Code Gecko line has been retired in favour of Gemini for code.
When Gemini is the right answer
For most of what most enterprises build, Gemini is the right model. Specifically:
- General Q&A, summarization, classification, extraction
- Multimodal tasks (Gemini’s native multimodality is genuinely strong)
- Tasks where 1M-token context is useful
- Cost-sensitive high-volume workloads (Flash is among the cheapest credible models)
- Anything that integrates tightly with Google Cloud services (BigQuery, Workspace, Vertex AI Search grounding)
We default to Gemini 2.5 Flash unless we have a measured reason to pick something else.
When Claude on Vertex AI is worth a look
Specific patterns where we’ve seen Claude outperform Gemini in side-by-side evals:
- Long-context careful reasoning over dense technical material. Claude tends to be more accurate at extracting subtle constraints from long documents (regulatory text, complex contracts, technical specifications).
- Strict instruction-following. When the output has to satisfy many simultaneous constraints, Claude’s adherence to detailed instructions tends to be tighter.
- Code generation in unusual languages or frameworks. Claude has historically been strong on less-common stacks.
- Tool-use chains where the model’s planning is the bottleneck. Multi-step agent workflows often benefit from Claude’s chain-of-tool-use behavior.
This isn’t a leaderboard claim. It’s an observation from running eval harnesses on real client use cases. Sometimes Gemini wins. We measure.
The important thing: Claude on Vertex AI is not a multi-cloud architecture. It’s a single-cloud architecture using a model that happens to be made by Anthropic. The data, the billing, the perimeter, the observability are all GCP. Enterprise buyers care a lot about this distinction, and it’s part of what makes Vertex AI’s Model Garden a real competitive advantage.
When open-weights on GKE is worth it
We’ve already covered this in the build vs. buy vs. fine-tune framework. The short version: it’s worth it for data sovereignty requirements that can’t be met any other way, for very high token volumes where managed inference cost becomes the dominant cost line, and for specialist models that aren’t available as managed endpoints.
For most enterprises, it’s not worth it. GPU operations are real engineering work, and managed inference is dramatically more cost-effective at low to moderate scale than people assume.
Routing patterns
Most non-trivial GenAI systems end up routing across multiple models. Three common patterns:
Cost-aware routing
User query
│
▼
Classifier (Gemini 2.5 Flash-Lite, cheap)
│
├── "simple Q&A" ───────► Gemini 2.5 Flash
│
├── "complex reasoning" ─► Gemini 2.5 Pro
│
└── "code generation" ──► Claude Sonnet 4.6
A cheap classifier decides where each query should go. Gemini 2.5 Flash-Lite (or Flash) is the right classifier because it’s fast and cheap. The actual generation goes to whichever model is best for the task type.
When to use it: heterogeneous workloads where most queries are simple and a small fraction are hard. Routing keeps your average cost low without compromising on quality for the hard cases.
Quality-aware fallback
User query
│
▼
Primary model (Gemini 2.5 Flash)
│
▼
Confidence check
│
├── high confidence ──► return response
│
└── low confidence ───► retry with Gemini 2.5 Pro or Claude Opus 4.7
Try the cheap model first. If the output passes a confidence check (structured output validation, response score from a verifier model, downstream tool success), return it. If not, escalate to a more expensive model.
When to use it: when most queries are answerable by a cheaper model and you want to spend the higher cost only on the cases that need it.
Latency-tier routing
For real-time applications where latency budget varies by code path. Conversation turns get Gemini 2.5 Flash (or Flash-Lite for the cheapest path). Background enrichment jobs get Gemini 2.5 Pro or Claude. Same system, different SLOs.
Practical implementation on Vertex AI
The clean architecture: a thin router layer in Cloud Run that decides which Vertex AI endpoint to call, with the same auth and observability for all routes.
from google import genai
from google.genai import types
from anthropic import AnthropicVertex
# All authenticated via Workload Identity, same VPC-SC perimeter
genai_client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
claude = AnthropicVertex(region="us-east5", project_id=PROJECT_ID)
async def call_gemini(model: str, prompt: str):
return await genai_client.aio.models.generate_content(
model=model,
contents=prompt,
config=types.GenerateContentConfig(max_output_tokens=2048),
)
ROUTES = {
"fast_qa": ("gemini_flash", lambda p: call_gemini("gemini-2.5-flash", p)),
"deep_reasoning": ("gemini_pro", lambda p: call_gemini("gemini-2.5-pro", p)),
"code": ("claude_sonnet", lambda p: claude.messages.create(
model="claude-sonnet-4-6@20251022",
max_tokens=2048,
messages=[{"role": "user", "content": p}],
)),
}
async def route_and_generate(query: str, intent: str, trace_id: str):
route, generator = ROUTES.get(intent, ROUTES["fast_qa"])
with tracer.start_span("model_call", attributes={"route": route, "trace_id": trace_id}):
result = await generator(query)
log_model_call(route, query, result, trace_id)
return result
Gemini calls go through the unified google-genai SDK (the older vertexai.generative_models module is being removed on 2026-06-24). Claude continues to use the AnthropicVertex SDK, which is still the supported path — swap claude-sonnet-4-6@20251022 for the equivalent Opus 4.7 dated stable when you need maximum capability rather than cost/quality balance.
The observability is the same for all routes (Cloud Trace span, structured log to Cloud Logging, eventual Gen AI evaluation run against the result). The routing decision becomes one more dimension to slice your evals by.
Anti-patterns
“Let’s use the best model for everything.” The best model is the one that meets your quality bar at your cost and latency budget. If Gemini 2.5 Flash hits 90% accuracy on your evals and Claude Opus 4.7 hits 91%, the right choice depends on how much 1% is worth at your query volume.
“We need to be multi-model from day one.” No. Build with a single model first, measure, add a second only when you have eval evidence that it’s worth the complexity. Multi-model architectures have higher operational cost: more eval surfaces, more routing logic, more failure modes.
“We can’t use Claude because we’re a GCP shop.” Claude on Vertex AI is a GCP service. Anthropic is one of the model providers in Model Garden. Using it doesn’t make your architecture multi-cloud, doesn’t take you outside your VPC-SC perimeter, and doesn’t change your billing or compliance posture. The “we’re a GCP shop” objection is a thing partners say, not a real architectural constraint.
“Open-weights will be cheaper.” Often not. By the time you’ve paid for GPU node-pools, on-call rotation, model update cycles, and the engineering time to keep it running, managed inference comes out ahead for moderate workloads. Run the actual numbers, don’t assume.
How Accelyze helps
We design multi-model GenAI systems on Vertex AI, including model selection (measured against your eval harness), routing layer implementation, and the observability that makes a multi-model system supportable in production. Our default is Gemini, our default is to keep things simple, and we’ll only recommend a multi-model architecture when the eval evidence makes the case. Get in touch if you’re considering it.