Developers

Bricks and Kits: How to Build a Stable Evaluation Schema That Never Drifts

Published June 23, 2026·7 min read·Alex Handsaker

The most common problem with AI-based call evaluation is not that the model is wrong. It is that the evaluation produces different results on different runs of the same transcript, or that scores from three months ago are not comparable to scores from today because something in the evaluation logic changed between runs. Both problems share a root cause: the evaluation schema was never explicitly defined, versioned, or governed. When the schema is implicit, embedded in a prompt or in a model’s general behaviour, it drifts every time the model is updated or the prompt is adjusted, and nobody notices until the reporting starts producing anomalies.

Bricks and Kits are Semarize’s approach to the schema problem. Bricks are the typed building blocks of an evaluation: individual, specific criteria that produce concrete typed output. Kits are versioned groups of Bricks that together constitute a stable evaluation schema. The combination produces a system where the evaluation logic is explicit, the output is typed, and changes to the schema are tracked rather than accumulated silently.

What a Brick actually is

A Brick is a single evaluation criterion with a defined question, a defined output type, and explicit scoring logic. It tests for one specific observable thing in a transcript and returns a concrete value of a declared type: boolean, score on a defined scale, categorical value from a defined set, or extracted string. A Brick for “economic buyer identified” returns a boolean. A Brick for “discovery quality” returns a score from one to five against defined anchors. A Brick for “competitor mentioned” returns a string with the competitor name or a null.

The specificity requirement is what distinguishes a Brick from a prompt. A prompt asks the model to assess something and relies on the model to interpret what that assessment means. A Brick specifies exactly what evidence in the transcript would confirm the criterion, what type the output takes, and what the valid values are. Applied to the same transcript on two separate runs, a well-designed Brick returns the same value. Applied to the same transcript by two different reviewers, a well-designed Brick produces the same conclusion. Consistency across runs and reviewers is the test that distinguishes a usable Brick from a vague prompt.

Bricks can be reused across multiple Kits. A discovery quality Brick defined once can appear in a sales coaching Kit, a MEDDIC scoring Kit, and a pipeline health Kit, applying the same criterion and returning the same output type in each context. Updates to the Brick definition propagate to every Kit that includes it, but only within a version: the current Kit version continues to use the Brick as it was defined when the Kit was locked, and a new Kit version is required to adopt the updated Brick.

What a Kit does

A Kit is a versioned group of Bricks that together define the evaluation schema for a specific use case. A discovery Kit groups the Bricks that together assess whether a discovery call covered the necessary qualification criteria. A QA Kit groups the Bricks that together check whether a call met the team’s compliance and process requirements. A win/loss Kit groups the Bricks that together capture the competitive and deal health signals useful for post-deal analysis.

The output of running a Kit against a transcript is a JSON object with one typed field per Brick in the Kit. Every call processed against the same Kit version produces an output with the same fields, the same types, and the same possible values. This consistency is what makes the output usable for CRM enrichment, warehouse analytics, and any downstream system that needs to reason about call data programmatically.

The versioning is what makes the output comparable over time. A Kit version is locked when it is deployed to production: the Brick definitions, output types, and valid values are fixed at that point. Running calls against the same Kit version two months apart produces outputs that were evaluated against identical criteria. When the evaluation logic needs to change, a new Kit version is created, and the old version continues to score any calls processed against it. Historical calls can be rescored against the new version explicitly, producing comparable outputs across both versions.

Why schema drift happens and what it costs

Schema drift in call evaluation happens in four common ways. The first is model updates: when an AI model used for evaluation is updated, its behaviour changes, and evaluation outputs that depended on specific model behaviour start returning different values without any change to the criteria. The second is prompt changes: when the instructions passed to the model are adjusted, even slightly, the outputs shift in ways that are difficult to predict and impossible to detect without running the same calls before and after the change.

The third is criterion creep: new evaluation criteria are added informally over time, old ones are quietly removed, and the schema the downstream systems expect starts diverging from the schema the evaluation actually produces. The fourth is field name changes: a criterion is renamed or restructured, the CRM field that was mapped to the old name no longer receives values, and the gap in the data is only noticed months later when a report starts showing unexpected nulls.

The cost of each of these is a dataset that looks continuous but contains invisible breaks. Time-series analyses that compare this month’s scores to last month’s are comparing values produced by different criteria against different logic, which makes the trend meaningless. Before-and-after training assessments that should show whether coaching improved performance show noise instead. Pipeline health models built on conversation signals start producing outliers that nobody can explain.

Designing Bricks that do not drift

A Brick that does not drift is one whose output is determined by evidence in the transcript rather than by model interpretation. The design principle is to make the criterion answerable from observable transcript content, with as little room for interpretation as possible. A criterion that asks whether the buyer described a specific financial consequence of their current problem is answerable from transcript evidence. A criterion that asks whether the rep demonstrated strong rapport requires the model to interpret what strong rapport means and is susceptible to drift with every model update.

The test is repeatability: run the same transcript against the Brick twice with a gap between runs. If the output is the same both times, the Brick is stable. If it varies, the criterion is too interpretive and needs to be tightened. For teams building Bricks for the first time, starting with the most concrete, evidence-grounded criteria and working toward more interpretive ones as the evaluation programme matures is a more reliable path than attempting to capture complex qualitative judgements in early Bricks.

Common questions

How many Bricks does a typical Kit contain?

Most production Kits contain between five and fifteen Bricks. Fewer than five and the Kit typically does not provide enough signal for the use case. More than fifteen and the Kit starts covering ground that belongs in a separate Kit for a different purpose: a MEDDIC Kit and a QA Kit are more useful as separate schemas than as a single twenty-Brick Kit that mixes qualification and compliance criteria. The right size is determined by what the Kit’s output will be used for and who will act on the results.

What happens to historical call data when a Kit is updated?

Historical calls retain the scores they were produced with under the Kit version that was active when they were processed. The old scores are not overwritten by a new Kit version. If comparable scores under the new version are needed for historical calls, those calls can be reprocessed against the new Kit version explicitly. The old and new scores are stored separately, with the Kit version as a field on each scoring record, so reporting models can filter by version when comparison consistency is required.

Can the same Brick appear in multiple Kits?

Yes. Bricks are reusable across Kits, which is part of what makes the schema model efficient to maintain. A discovery quality Brick defined once can be included in a MEDDIC Kit, a coaching Kit, and a pipeline health Kit. When the Brick is updated, each Kit that includes it can adopt the update by creating a new Kit version with the updated Brick, or continue using the current version with the original Brick definition until a version update is appropriate.

Continue reading

Bricks and Kits: the mechanism for stable conversation evaluation

Freeform prompts produce inconsistent evaluation results - scores drift, output shapes change, and you can't tell whether coaching improved anything or whether the rubric moved. Bricks define a locked evaluation schema: one question, one output type. Kits group them into reusable evaluation workflows. The result is schema-stable conversation analysis you control.

Read post

Sales Coaching

AI Scorecards Don't Disagree. Your Prompt Does.

Inconsistent AI scorecards aren't an AI problem - they're a process failure. Freeform prompts ask the model to re-interpret evaluation criteria on every run, and that interpretation drifts with phrasing, model updates, and context. The fix is an evaluation contract: a locked schema with defined output types that produces the same result on the same call, every time.

Read post

QA & Compliance

100% QA Scoring Without Manual Review: Deterministic Rubrics for Every Call

Manual QA sampling at 2–5% has two problems: coverage and consistency. Automated scoring with deterministic rubrics solves both - every call gets scored the same way, with no reviewer required to generate the result. The shift isn't just efficiency - it changes what coaching is built from and turns compliance verification from sampling into complete coverage.

Read post