How to Evaluate a GenAI Application Before You Ship It

If you can’t tell the difference between version 7 of your system and version 8 by looking at a number, you’re not really iterating. You’re guessing. Most of the GenAI teams we work with arrive without a working eval harness, and the first month of an engagement is often about building one before anything else makes sense.

This post is the eval design we apply on every engagement. The mechanics are specific to the Vertex AI Gen AI evaluation service (the product Google still informally calls “Vertex AI Evals”) and BigQuery, but the patterns translate.

What an eval harness has to do

In one sentence: produce a score that tells you whether the system got better or worse, on inputs that matter, in a way that’s reproducible.

That has three parts:

A dataset of inputs that represent real usage, with ground-truth references where they exist.
A scoring function that maps system output to a number, ideally one that correlates with what humans would say.
A pipeline that runs the system over the dataset, applies the scorer, stores the results, and lets you compare runs.

Every part of this can be wrong in interesting ways.

Building the golden set

The golden set is the dataset. 50 to 150 examples is the right starting size for most use cases. Fewer than 50 and your scores are too noisy to compare runs. More than 150 in v1 is wasted effort, since you’ll discover what the dataset is missing as soon as you run it.

Sources, in order of preference:

Real production traffic with SME labels. Highest signal. The closest to what users actually ask. Requires you have an existing system or pilot to sample from.
Real user-submitted queries from competing or predecessor systems. Almost as good. Forum questions, support tickets, search logs.
SME-generated examples. A domain expert writes inputs and ground-truth answers. Useful when no production data exists. Watch out for SME bias toward easy or canonical cases.
Synthetic generation from real documents. Use Gemini to produce Q/A pairs from your corpus. Cheap, fast, useful for bootstrapping. Lower signal than the others. Always seed with a few human examples to anchor style.

The right mix is usually a portfolio. 60% real traffic, 25% SME-generated edge cases, 15% synthetic stress tests.

Store goldens in BigQuery, in a table per use case. Versioned. The schema we use:

CREATE TABLE evals.goldens_contract_review (
  golden_id          STRING NOT NULL,
  version            STRING NOT NULL,    -- 'v1.0', 'v1.1', ...
  category           STRING,             -- 'termination_clause', 'liability', ...
  difficulty         STRING,             -- 'easy', 'medium', 'hard', 'edge_case'
  input              JSON NOT NULL,      -- request payload
  expected_output    JSON,               -- ground truth answer
  expected_tool_calls ARRAY<STRING>,     -- for agents
  metadata           JSON,               -- annotation source, SME, date
  created_at         TIMESTAMP NOT NULL,
  is_active          BOOL NOT NULL
);

Versioning matters. When you correct a golden (because the ground truth was wrong, or the use case evolved), you bump the version. Past runs stay comparable to past versions.

The scoring function

This is where teams overcomplicate things. There are three layers of scoring, used together:

Layer 1: Exact-match and structural checks

For anything with structured output (JSON, classifications, extractions), exact match on the relevant fields is cheap and unambiguous. Always include.

def score_extraction(actual: dict, expected: dict) -> dict:
    return {
        "termination_days_match": actual.get("termination_days") == expected.get("termination_days"),
        "liability_cap_match": actual.get("liability_cap") == expected.get("liability_cap"),
        "structure_valid": validate_schema(actual, EXPECTED_SCHEMA),
    }

If you can score 80% of your outputs at this layer, do it. Don’t reach for LLM-as-judge when a regex will do.

Layer 2: Embedding similarity

For free-text outputs where exact match doesn’t apply (summaries, explanations, paraphrases), embedding similarity between actual and expected gives you a continuous score. Useful for detecting whether the system is staying on topic.

Watch out: high embedding similarity doesn’t guarantee correctness. Two sentences can be semantically similar and one of them wrong. Use it for detecting regression, not for establishing correctness.

from google import genai

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

def embed(text: str) -> list[float]:
    resp = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
    )
    return resp.embeddings[0].values

def embedding_similarity(actual: str, expected: str) -> float:
    return cosine_similarity(embed(actual), embed(expected))

Layer 3: LLM-as-judge

For things humans would judge subjectively (is this answer helpful, is this summary faithful, does this response address the user’s question), LLM-as-judge is the practical tool. A separate Gemini call rates the output against criteria.

JUDGE_PROMPT = """
You are evaluating whether an AI assistant's answer correctly addresses the user's question.

Question: {question}
Reference answer: {reference}
Actual answer: {actual}

Rate the actual answer on a scale of 1 to 5:
1 = wrong or off-topic
2 = partial answer with significant gaps
3 = mostly correct, minor issues
4 = correct and complete
5 = correct, complete, and well-explained

Output your reasoning, then the score. Format:
REASONING: ...
SCORE: N
"""

Critical: calibrate the judge. Take a sample of 30 to 50 outputs, have a human rate them on the same scale, run the judge over the same outputs, compute correlation. If the judge agrees with humans 80% of the time, it’s usable. If it agrees 50% of the time, your judge prompt needs work.

Pairwise judging (giving the model two outputs and asking which is better) is more reliable than absolute scoring. Use it when comparing model versions or prompt changes.

The pipeline

The Vertex AI Gen AI evaluation service is the GCP-native tool for running this. It handles golden dataset management, scoring (including built-in metrics and custom LLM-as-judge), and run comparison. At Cloud Next 2026 it also picked up agent-trajectory evaluation — scoring the full sequence of tool calls and intermediate reasoning steps an agent takes, not just the final answer. Relevant to anything you build with ADK or Agent Engine.

The end-to-end flow we set up:

┌──────────────────────────────────────────────────────────────────┐
│  Golden set (BigQuery)                                            │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│  Vertex AI Evals pipeline                                          │
│   1. Load golden set                                               │
│   2. Run system under test (call your prod or staging API)         │
│   3. Score (exact match + embedding + LLM-as-judge)                │
│   4. Aggregate scores per category and difficulty                  │
│   5. Compare against previous run                                  │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│  Results table (BigQuery)                                          │
│   run_id, system_version, prompt_version, model_version,           │
│   golden_id, score_components, overall_score                        │
└──────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│  Looker Studio dashboard                                           │
│   Per-version scores, category breakdowns, regression alerts        │
└──────────────────────────────────────────────────────────────────┘

We trigger evals from Cloud Build on every prompt change, every tool definition change, and weekly on a schedule against production. Cloud Build compares the new run’s score against the previous run and fails the build if a regression of more than 3% on any category appears.

Where teams get this wrong

They wait too long to start. The eval harness is the lowest-leverage thing to build until you have a system, at which point it becomes the highest-leverage thing. Build it first. See the engagement playbook on this.

Their goldens are too easy. A 95% pass rate on goldens that consist of softball questions doesn’t mean the system is good. Goldens should include the cases the system struggles with, weighted appropriately. The “edge_case” difficulty bucket is where the interesting signal lives.

They don’t version the golden set. When a golden is wrong (and some always are), they edit it in place. Then a past run’s score is no longer comparable. Always version.

They trust the LLM-as-judge without calibration. Judges have biases. They prefer longer answers, more formal tone, certain phrasings. If you haven’t checked how your judge agrees with humans, you don’t know what your scores mean.

They run evals manually. Manual evals get skipped when the team is busy. Automated evals run on every change. Wire them into Cloud Build.

They don’t sample production. The eval set you ship with diverges from real usage within weeks. Sample 1% of production into the eval pipeline, periodically promote interesting cases into the golden set.

Operational evals: drift detection in production

Pre-ship evals are about whether to ship. Post-ship evals are about whether the system has drifted.

Track these continuously in production:

Refusal rate. How often is the system declining to answer? Sudden spikes mean something upstream changed.
Output length distribution. Sudden shifts in output length often indicate a prompt or model change.
Tool selection distribution (for agents). Are tool call frequencies stable, or has one tool started getting called twice as often?
Latency p50 / p95. Latency drift often indicates retrieval slowdowns or upstream model changes.
Cost per request. Token counts drifting upward means longer prompts, longer outputs, or both.

These are Cloud Monitoring metrics, not eval scores. Alerting on them catches the operational drift that quality evals miss.

A note on Vertex AI’s built-in evaluators

The Gen AI evaluation service ships with built-in metrics: bleu, rouge, exact match, embedding similarity, plus pre-built LLM-as-judge templates for common tasks (summarization quality, question-answering helpfulness, instruction following) and the newer agent-trajectory metrics. They’re a good starting point. Layer your custom domain-specific judge on top of them.

The built-in templates are particularly useful because they’ve been validated against human judgments on benchmark datasets. Using them as one component of your overall score gives you a defensible baseline.

How Accelyze helps

We design and build eval harnesses as part of every GenAI engagement, in Phase 2 of our engagement playbook. The work includes golden set construction (often involving SME interviews), scorer design and calibration, Vertex AI Evals pipeline setup, and Cloud Build integration. We also help teams retrofit eval harnesses onto existing GenAI applications that shipped without one. If your team is shipping GenAI without measurable quality criteria, get in touch.

GenAI Strategy & Readiness

Pilot to Production Delivery

MLOps & Platform Enablement

GenAI Risk & Governance