Developers

Evaluate AI Agents at Scale With Structured Checks

Published June 3, 2026·10 min read·Alex Handsaker

An AI agent ships, the dashboard turns green, and a few weeks later a customer quotes it saying something your product has never done. The evaluation meant to catch it returned a confident paragraph for every conversation, none of which you could turn into a threshold, an alert, or a trend line. That is the normal state of AI agent evaluation: output that reads as intelligent and tells you almost nothing you can act on.

The judge is rarely the weak link. What is missing is a fixed output. To evaluate AI agents at scale you need a locked, deterministic schema: a versioned set of typed checks that scores every conversation the same way and returns fields instead of prose, such as whether the agent answered the question that was asked, stayed inside its role, and made only claims your documentation supports. Run that schema against 100% of conversations, ground it in your own knowledge so accuracy is judged against approved answers rather than plausibility, and the result is evaluation you operate on rather than read, at full volume and without a manual annotation queue.

Hand-sketched comparison of freeform judge prose with no threshold or trend versus a locked contract returning typed fields for thresholds, alerts, and trends. — Agent evaluation becomes operational when judge output is constrained to typed fields instead of prose.

Why freeform evaluation doesn’t scale

The standard advice when scores feel unreliable is to write a stronger LLM-as-judge prompt or label more data, and neither of those fixes the underlying issue, which is what the model is asked to return. When evaluation asks the judge to “assess the quality of this conversation and provide feedback”, two separate runs on the same conversation produce two different analyses, each internally coherent and neither directly comparable. You can’t run aggregate statistics on prose, you can’t set a pass threshold on a paragraph, and you can’t detect drift by diffing two sets of unstructured text.

Manual annotation solves the consistency problem but creates a volume problem. AI agents generate far more conversations than a human annotation team can keep pace with, and coverage falls further behind each time the underlying model is updated. Evaluation needs to run at full conversation volume, consistently, without degrading as the deployment scales, and a labelling queue moves in the opposite direction.

Keyword matching catches surface patterns but misses semantic accuracy. An agent can use the right words in the wrong order, or state an approved fact in a context that makes it misleading, and the match returns green. None of the standard alternatives, manual annotation, freeform LLM judge, or keyword matching, produce what a production AI evaluation pipeline actually needs: a locked schema running the same checks and returning typed values every run.

How to evaluate AI agents at scale without manual annotation

The reframe is the unit of evaluation. You score a set of structured, deterministic signals, assessed the same way on every run, rather than the freeform agent response text a judge fixates on. A deterministic evaluation contract is a set of criteria, each defined as a specific answerable question, each returning a typed value: a boolean, a score on a defined scale, an extracted string, or a categorical flag. The question must be answerable from evidence in the conversation, and the same conversation is scored against the same typed question every run, so results are comparable rather than re-interpreted.

Once the schema is locked, evaluation runs consistently regardless of which LLM executes the checks, because each check is defined in terms of observable evidence rather than holistic judgement. “Did the agent answer the question the user actually asked?” is deterministic and returns a boolean. “Was the agent's response high quality?” returns prose that reflects whatever the judge interpreted as quality on that particular run. Constraining the output to typed values is what makes the evaluation operationalisable: an LLM running a well-defined check returns a value in a defined range with supporting evidence, and the stability comes from the schema rather than from replacing the model with a different mechanism.

Building the contract with Bricks

In Semarize, each evaluation criterion is a Brick: a single typed check the API applies to a conversation and returns one concrete value for. A Brick asks one specific, evidence-answerable question and returns one typed output, a boolean, a score, an extracted string, or a categorical flag, and it doesn’t summarise, interpret, or produce prose. The specificity requirement is what keeps results consistent across runs, because a narrow question has a checkable answer.

For agent quality, Bricks map directly onto the dimensions a platform team cares about. A response relevance Brick checks whether the answer addressed the user's actual question rather than an adjacent one. An instruction adherence Brick scores whether the agent stayed within its defined role and followed the system rules it was given. A tone Brick flags whether the response matched the required register and brand voice. A safety Brick flags whether the response avoided prohibited or unsafe content. A claim accuracy Brick checks whether a factual statement the agent made is supported by your approved documentation. Each one is a single typed field, scored the same way on every conversation.

Kits group related Bricks into versioned evaluation schemas. A Kit for support agent quality might include Bricks for response relevance, instruction adherence, claim accuracy, tone, and safety, and running it against every conversation returns one structured JSON object with one typed field per Brick: the same shaped object, from the same contract, on every run. Kits are versioned, so the evaluation contract is stable across deployments and any change to it is explicit rather than silent, which is what lets you trust a number you compared across two months.

Grounding evaluation in your knowledge base

The failure mode that matters most is the confident wrong claim: the agent states something that sounds accurate, receives no pushback from the user, and is only identified as wrong when a downstream consequence makes it visible. Generic evaluation can’t catch this because it has no reference for what “correct” looks like for your product, your docs, or your approved answers. It can only assess plausibility, and plausibility is exactly what a hallucination is engineered to pass.

Grounded evaluation changes the check from “does this sound right” to “does this match what you have defined as right.” Knowledge grounding lets each Brick read the document that defines the correct answer for that dimension: a claim accuracy Brick reads your approved product docs, a pricing Brick reads your rate card, a qualification Brick reads your ICP definition. When the agent states something the grounding document does doesn’t support, the Brick checks the claim against that document and returns a flag with the transcript quote and the document reference, so the result is verifiable evidence rather than a sentiment read. That evidence is what makes hallucination detection defensible when someone asks why a conversation was flagged.

The practical discipline is keeping grounding documents bounded in purpose. A document that covers too many claims gives the Brick too broad a scope to check precisely, and the answer gets fuzzy. One document, one dimension, one Brick is the pattern that produces auditable hallucination detection at scale, and it’s also what lets a non-engineer read the evidence and agree with the flag.

Hand-sketched pipeline showing an agent conversation evaluated by one Brick check grounded against a knowledge document and returning typed result plus evidence. — Grounded checks compare each claim against the document that defines the approved answer.

Operationalising structured scores into thresholds, alerts, and feedback loops

Typed scores are only useful once you act on them, and that’s where the behavioural shift lives: stop sampling, stop reading judge prose, and operationalise the structured output. The contract returns typed JSON per run via API and webhook, and from there each field has an obvious job. Thresholds turn fields into alerts: a safety flag that trips or a response relevance score under your floor pages the on-call owner or opens a ticket. Distributions turn fields into dashboards: the share of conversations passing instruction adherence, the rate of grounding failures on product claims, plotted over time so trends are legible at a glance.

Results then feed back into the system that produced them. A cluster of low response relevance scores on a particular intent points at a retrieval or prompt gap; a rise in tone failures points at a register the system prompt doesn’t enforce; a grounding failure with its quote and document reference attached tells the team exactly which doc to correct or which guardrail to add. Because every conversation carries its evidence, the loop from a flagged field to a concrete fix is short, and the fix is verifiable on the next run.

Detecting drift and measuring stability over time

Vendor model updates aren’t always announced with enough specificity to tell you whether they affect the dimensions you care about. Without a locked schema running on 100% of production conversations, quality changes after an update stay invisible until a pattern of complaints accumulates, by which point the regression has been running for weeks. A deterministic contract gives you the baseline to catch it, because the same schema runs against every conversation and produces the same output structure, so you can track each field's distribution over time.

Stability is the thing you are actually measuring: the same call, scored against the same schema, should land in the same place in the distribution run after run. When instruction adherence scores shift after a model update, or grounding failures on product claims climb, the change is visible in the data before it reaches users. The evaluation reports what changed in the agent's behaviour against your standards, which is the signal that matters, regardless of what shifted inside the vendor's model. That is also the line between this and a vendor analytics dashboard: completion rates, sentiment scores, and escalation counts measure the agent against the vendor's model of a good interaction, while a locked contract measures it against yours.

Hand-sketched drift monitoring chart showing pass rate over time dropping after a model update and triggering an alert before complaints. — A locked schema gives you a baseline for catching quality drift after model or prompt changes.

Applying the same contract to human and agent conversations

One immediate benefit of the contract-based approach is that the same Kit runs against human rep conversations and AI agent conversations without any change to the schema. A team already scoring human calls with a Kit can extend evaluation to agent conversations by adding a new conversation source, not by building new infrastructure, and the fields, grounding documents, and output types stay identical. The relationship between deterministic rubrics and full-coverage scoring is covered in more depth in scoring 100% of conversations without manual review.

The comparison then becomes direct. Pull the field distributions for human rep conversations and AI agent conversations across the same period and compare them dimension by dimension. Where the agent outperforms on instruction adherence, that is signal on where the human motion has gaps; where the agent underperforms on claim accuracy, that is where the grounding documents or the agent prompts need tightening. The shared evaluation contract is what makes the comparison interpretable rather than directional, because both sides were scored against the same typed checks.

Semarize runs deterministic evaluation contracts against AI agent and human rep conversations at production scale, returning typed JSON with evidence for every check.

Start building →

Common questions

What does a locked evaluation contract look like in practice?

It is a versioned Kit of typed Bricks, where each Brick asks one specific question about a conversation and returns one typed value: a boolean, a score on a defined scale, an extracted string, or a categorical flag. Running the Kit returns the same shaped JSON object on every run, with one field per Brick, so each conversation is scored against the same typed questions every run and results stay comparable rather than re-interpreted. Because the Kit is versioned, any change to the contract is explicit rather than silent, which is what lets you compare a score from this month against a score from last month and trust the comparison.

How do you ground hallucination detection in a company knowledge base?

Each Brick that checks a factual claim is attached to the document that defines what the correct claim looks like. A Brick checking product claim accuracy reads your approved product docs, and a Brick checking pricing accuracy reads your rate card. The Brick compares what the agent stated against what the document says and returns a boolean flag with the transcript quote and the document reference as evidence. One question, one document, one Brick produces auditable hallucination detection rather than a broad plausibility check, so anyone can read the evidence and agree with the flag.

Can LLM-as-judge be used if the outputs are structured JSON?

Yes. LLM-as-judge fails because most implementations ask the judge to return freeform prose, which can’t be thresholded, trended, or routed. When the judge is constrained to return structured JSON with defined fields, a boolean for pass or fail, a score within a defined range, and a text field for the specific evidence, the output is operationalisable. The evaluation contract defines what the judge is allowed to return and the LLM executes the check, so stability comes from the schema rather than from replacing the LLM.

How do you set thresholds without overfitting to a small label set?

Set thresholds against the distribution you observe across 100% of conversations, not against a handful of hand-labelled examples. Once the same schema scores every conversation, you can read the real spread of each field and place a threshold where it separates acceptable from unacceptable outcomes at volume. Start with a conservative floor on safety and response relevance, watch the alert rate, and adjust against the live distribution rather than a fixed label set, so the threshold tracks production behaviour rather than a sample that may not represent it.

What does quality drift monitoring measure and how often should it run?

Quality drift monitoring tracks the distribution of evaluation field values over time against a baseline. For each Brick in your Kit you establish a baseline from a normal production period, then monitor for shifts, so a drop in instruction adherence pass rate or a rise in grounding failures on product claims signals a change in agent behaviour. For high-volume deployments, daily aggregates catch changes within a day or two of a model update. For lower-volume deployments, weekly cohort comparisons against the baseline serve the same purpose.

Continue reading

Start Evaluating Agents the Way You Should Be Evaluating Humans

If you're running structured evaluation on every human rep conversation, your AI agent conversations should go through the same contract. Vendor metrics tell you how the agent performed against the vendor's model of a good interaction. Your evaluation standards are different. The same deterministic Kit you use for human reps applies directly to AI agent conversations: same schema, same grounded Bricks, same structured JSON output at 100% of production volume.

Read post

Sales Coaching

AI Scorecards Don't Disagree. Your Prompt Does.

Inconsistent AI scorecards aren't an AI problem - they're a process failure. Freeform prompts ask the model to re-interpret evaluation criteria on every run, and that interpretation drifts with phrasing, model updates, and context. The fix is an evaluation contract: a locked schema with defined output types that produces the same result on the same call, every time.

Read post

Product

Bricks and Kits: the mechanism for stable conversation evaluation

Freeform prompts produce inconsistent evaluation results - scores drift, output shapes change, and you can't tell whether coaching improved anything or whether the rubric moved. Bricks define a locked evaluation schema: one question, one output type. Kits group them into reusable evaluation workflows. The result is schema-stable conversation analysis you control.

Read post