The GenAI Consulting Engagement Playbook: How We Scope, De-Risk, and Ship in 90 Days

Plenty of consulting firms do GenAI work. Fewer can scope it correctly, run a disciplined prototype-to-production cycle, and hand off a system the client can actually keep running. This post is our engagement methodology at Accelyze. Four phases, roughly 12 weeks for a standard build, with defined deliverables at each gate.

We publish it because process transparency is a kind of credibility. Anyone can claim to build production GenAI. Fewer can describe, specifically, how they do it.

The four phases

Phase 1: Discovery and use-case ranking (weeks 1 and 2)

The most expensive mistake in a GenAI engagement is building the wrong thing with great engineering. Phase 1 exists to prevent that.

We run a structured discovery that produces one artifact: a use-case ranking matrix. Every candidate use case is scored on three dimensions.

Dimension	What we’re assessing
Business value	Revenue impact, cost reduction, or strategic position. Quantified where we can. “10% deflection rate improvement” beats “better customer experience.”
Feasibility	Does the data exist? Is it accessible? What’s the expected quality bar? Have comparable systems been built before?
Data readiness	Is there a labeled dataset for evaluation? Where does ground truth come from? What data governance constraints apply?

The matrix forces a conversation most clients haven’t had explicitly. A use case that scores high on value but low on data readiness (“an AI that predicts customer churn” when there’s no labeled churn data and no historical record of intervention outcomes) is not a 90-day project. It’s a 6-month data infrastructure project that happens to have a model at the end. We say that clearly, and we steer toward the use cases that can actually be shipped.

Phase 1 deliverables:

Use-case ranking matrix with scoring rationale
Data audit: what exists, what needs collecting or labeling, what’s blocked by governance
Recommended primary use case plus one or two runner-up alternatives
Initial architecture sketch (linking to our reference architecture) with components identified per use case
Go/no-go recommendation with written rationale

A pattern we see often: the use case the client wants to build is the third-best one they have. The best is usually something operational and unglamorous (document extraction, classification, internal search) with clear ground truth and measurable impact. We push for that one.

Phase 2: Eval harness first (weeks 3 and 4)

No model work until the eval harness is running.

This is our most consistent differentiator, and the thing clients push back on most. “Why are we building evaluation infrastructure before we’ve built anything to evaluate?” Because the eval harness defines what “good” means. Without it, you’ll spend weeks iterating on a model with no principled way to compare today’s version against last week’s.

The harness we build in this phase isn’t sophisticated. It’s:

A golden set. 50 to 150 input/output pairs that represent the desired behavior. Built from existing examples, SME-labeled cases, or synthetic generation from representative real data.
A scoring function. A combination of exact-match metrics (for structured outputs), embedding similarity (for free text), and LLM-as-judge (for semantic correctness where exact match is meaningless).
A baseline score. Run your simplest possible implementation (stock Gemini, no retrieval, minimal prompt) against the golden set. Every later iteration is measured as delta from baseline.

The harness runs in the Vertex AI Gen AI evaluation service (the product Google’s docs used to refer to informally as “Vertex AI Evals”). Golden sets live in BigQuery. Scoring is reproducible and version-tracked.

Phase 2 deliverables:

Golden set (50 to 150 examples) with annotation guidelines
Eval harness running in the Vertex AI Gen AI evaluation service, integrated with BigQuery
Baseline score for the simplest implementation
Written eval framework doc: what’s being measured, why, and what “good enough to ship” looks like as a number

The anti-pattern this prevents: “we’ll figure out success criteria later.” That’s how you end up in week 14 of a six-month engagement negotiating what success means with a prototype already built. The eval harness is the end condition, defined before the first experiment.

Phase 3: Thin-slice prototype (weeks 5 through 10)

With a use case chosen and an eval harness running, we build. But thin.

A thin-slice prototype is the minimum viable version of the primary use case, running on production infrastructure (not localhost), evaluated against the golden set, accessible to a small group of internal users. It’s not a full-featured application. It deliberately leaves out the long tail of edge cases, the admin UI, the reporting dashboard, the integration with every downstream system. Those come in Phase 4.

Three two-week sprints:

Sprint	Focus
Sprint 1 (weeks 5 and 6)	Core pipeline: prompt, retrieval, generation. Target: meaningful improvement over baseline.
Sprint 2 (weeks 7 and 8)	Retrieval quality, prompt refinement, structured output validation. Target: within 10% of “good enough” threshold.
Sprint 3 (weeks 9 and 10)	Hardening for internal pilot: auth, logging, error handling, cost instrumentation. Target: at or above threshold.

End of Sprint 3, a small group of internal users (5 to 15 people) gets access. Their feedback is structured. We’re not asking “do you like it.” We’re asking them to complete five specific tasks and tell us where the system failed.

Architecture is the Accelyze reference stack from day one. Cloud Run, Vertex AI, AlloyDB or Vertex AI Vector Search depending on corpus size, Pub/Sub for async events, Cloud Logging for observability. No localhost infrastructure that needs replacing later.

Phase 3 deliverables:

Running prototype on GCP (non-prod environment)
Eval score at or above threshold
Internal pilot report: structured feedback from 5 to 15 users
Performance baseline: p50 and p95 latency, cost per request, error rate
Gap analysis: what’s missing for production, ranked by impact

Phase 4: Production hardening (weeks 11 and 12 onward)

Phase 4 converts the prototype into a production system. The gap analysis from Phase 3 drives the work. Typical items:

Reliability and observability. SLOs defined and implemented in Cloud Monitoring. Cloud Trace instrumented across the full request path (Cloud Run, Vertex AI, AlloyDB, response). Alerting on latency, error rate, and cost anomalies. Graceful degradation for Vertex AI availability blips.

Security and data controls. VPC-SC perimeter if required. CMEK on BigQuery datasets and Cloud Storage buckets. Secret Manager replacing any env vars with credentials. IAM tightened to least-privilege. Data residency verified against org policy.

Cost controls. Per-request cost tracking via BigQuery billing export plus Cloud Logging structured logs. Gemini context caching (via the caches client in the unified google-genai SDK) for repeated system prompts. Embedding caching in AlloyDB for documents already indexed. Cost alerting at defined thresholds.

Handover. The final deliverable isn’t a system. It’s a system plus the knowledge to operate it. We produce: runbooks for common operational scenarios, an architecture decision record (ADR) documenting key choices and the alternatives we considered, a guide to extending the eval harness as the system evolves, and a 30-day hypercare plan with defined escalation paths.

Phase 4 deliverables:

Production system deployed in the client’s GCP environment
Monitoring dashboards and alerting configured
Security review completed (VPC-SC, CMEK, IAM)
Runbooks, ADR, eval harness extension guide
30-day hypercare plan

The 12-week timeline

Week  1  2  3  4  5  6  7  8  9  10 11 12
      ├──┤                                   Phase 1: Discovery, use-case ranking
            ├──┤                             Phase 2: Eval harness
                  ├───────────────┤          Phase 3: Thin-slice prototype (3 sprints)
                                    ├──┤    Phase 4: Production hardening

Gates:
  End of week 2:  Use-case chosen, data audit complete, go/no-go
  End of week 4:  Eval harness running, baseline score established
  End of week 10: Prototype at threshold, internal pilot complete
  End of week 12: Production system live, handover complete

Real engagements compress or expand this timeline based on integration complexity, client data readiness, and approval cycles. The phases are sequential. We don’t start Phase 3 without a running eval harness. The gates are real.

Anti-patterns this prevents

“POC first, success criteria later.” We’ve seen more GenAI projects derail on this than on any model quality issue. If you can’t articulate what good looks like before you build, you’ll be negotiating what good means at the end, with a prototype already built and a deadline looming. Define the threshold in Phase 2. Use it to end the engagement cleanly in Phase 3.

“Let’s pilot it with real customers.” A GenAI system that hasn’t passed an eval threshold isn’t ready for customer-facing use. The internal pilot in Phase 3 Sprint 3 is the buffer. Real customer traffic comes after Phase 4 hardening, not before.

Model-first scoping. “We want to fine-tune Gemini on our data” isn’t a use case. It’s a technique choice made before a problem has been defined. We evaluate fine-tuning against grounding and retrieval-augmented approaches in the context of a specific use case and a running eval harness. Sometimes fine-tuning is the right answer. Most of the time, especially in a 90-day engagement, it isn’t. See our build vs. buy vs. fine-tune framework for how we call that.

Infrastructure left as an exercise. Prototype and production system run on the same GCP infrastructure from day one. The only difference is scale tier and security controls. Phase 4 is a matter of configuring what’s already in place, not replacing what was built.

Skipping the handover. The client should be able to operate and extend the system after we leave. An engagement that ends with a working prototype but no runbooks, no ADR, and no eval extension guide has delivered a liability, not an asset. The 30-day hypercare and structured knowledge transfer aren’t optional scope.

How Accelyze helps

Accelyze runs this four-phase engagement for clients building GenAI applications on Google Cloud. We handle the full scope: discovery, eval harness design, prototype delivery, production hardening, handover. Defined timeline, defined deliverables at each gate. If you’re planning a GenAI initiative and want a team that will tell you what done looks like before you start building, get in touch.

GenAI Strategy & Readiness

Pilot to Production Delivery

MLOps & Platform Enablement

GenAI Risk & Governance