100% QA Scoring Without Manual Review: Deterministic Rubrics for Every Call
Manual QA sampling at 2–5% of calls has a coverage problem and a consistency problem. The coverage problem is obvious: reviewing one in twenty calls means coaching from a sample that may not represent what's actually happening across the team. The consistency problem is subtler but just as damaging: different reviewers assessing the same call against the same rubric regularly reach different scores. Coaching built on inconsistent scores produces inconsistent direction - and often, no change at all.
The standard response is to add reviewers and tighten rubric definitions. That can reduce disagreement at the margins, but it doesn't solve the coverage problem - you still can't manually review every call at scale - and it doesn't eliminate the interpretive element that causes inconsistency in the first place. The problem is structural. Manual review is the measurement instrument, and at any useful scale, the instrument is unreliable.
What 100% QA actually means
Automated QA scoring across 100% of calls means running every recorded call through the same deterministic rubric, with no human reviewer required to generate the score. The rubric is a fixed evaluation schema with defined fields and defined output types. The same call, evaluated twice, produces the same score. Coverage is complete because the process runs automatically on everything, not selectively on what someone has time to review.
The fields in a contact centre QA rubrictypically cover five areas: resolution quality (did the issue get resolved on this call), empathy demonstrated (specific language patterns that indicate acknowledgement of the customer's experience), script adherence (required disclosures and process steps completed), escalation handling (was a potential escalation identified and addressed correctly), and compliance disclosure (required regulatory language spoken at the required point in the call). Each field returns a boolean or a score, with the supporting evidence from the transcript attached.
That last point matters. The evaluation doesn't require a human to interpret whether the agent “sounded empathetic.” It checks whether specific language that operationalises empathy appeared in the transcript and returns the result with the quote that supports it. The score is grounded in evidence, not impression.

Why QA scorecards become theatre
Most QA programmes end up measuring the wrong thing. Teams that invest in conversation intelligence to improve coaching typically find themselves measuring what the agent did - script adherence, disclosure language, structured call steps - rather than whether the customer's problem was actually resolved. These are inputs to a good call, not evidence that the outcome was achieved. When QA optimises for process compliance, agents learn to produce calls that score well without necessarily solving the customer's problem.
The consequence is a QA scorecard that looks detailed but doesn't change outcomes. Scores improve. Customer satisfaction doesn't follow. This is the pattern that makes AI scorecards theatre: the evaluation is measuring whether the process was performed, not whether the customer understood anything, got their issue resolved, or left the call with confidence in the resolution. Shifting QA toward outcome evidence - resolution confirmed, understanding demonstrated - requires rubric fields that ask different questions than script adherence tracking does.
The difference between deterministic and subjective rubrics
Most QA rubrics are written for human reviewers, which means they allow for interpretive latitude. “Did the agent demonstrate empathy?” is a judgment call. “Did the agent use language acknowledging the customer's frustration before moving to resolution?” is an observable fact. The first produces reviewer disagreement. The second produces consistent results regardless of who - or what - runs the evaluation.
Deterministic rubrics translate every scoring criterion into the observable form. For each criterion: what specific evidence would indicate this was met? What would the transcript contain if the agent followed the required script at this point? If the criterion can't be defined as an evidence test in the transcript, it either needs reformulation or explicit recognition as a judgment call that requires human review. The practical test: can two people run the same criterion against the same transcript and reach the same result without discussing it first? If not, the definition needs tightening before automation can handle it reliably. The QA and compliance use case covers rubric design for automated scoring, including field definitions that hold up at scale.

Where human review still belongs
Automated QA at 100% coverage doesn't eliminate human review - it changes what it's for. When scoring is automated, reviewers should focus on two things: calibrating the rubric (reviewing low-confidence scores and edge cases to refine the evaluation logic over time) and handling genuine judgment calls that the rubric can't capture deterministically (calls with regulatory implications, ambiguous context, or unusual circumstances where the evidence alone doesn't resolve the question).
In practice, this means manual review drops from covering a sample of all calls to covering a targeted slice of genuinely complex cases - roughly 2–3% rather than 5%, but with much higher signal density. The reviewer is no longer the measurement instrument. They're maintaining the accuracy of the instrument - ensuring the rubric stays calibrated as call patterns evolve and edge cases surface.

What changes operationally
Complete coverage changes what coaching conversations are based on. Instead of working from a 5% sample, managers see a complete record. If an agent consistently defers escalation language, that pattern appears across fifty calls a month, not in the two that happened to get sampled. Coaching from the complete pattern is more accurate, more specific, and harder to dismiss as anecdotal. Agents can't argue that the reviewed calls weren't representative - because every call was reviewed.
Compliance management changes in parallel. Regulatory disclosure requirements that need to be met on every applicable call can be verified across every applicable call - producing an audit-ready record rather than a sampling note. When a disclosure was missed, you know which calls, which agents, and how consistently. That evidence base supports remediation, training decisions, and regulatory documentation in ways that a manually reviewed sample never can.
How Semarize supports 100% QA scoring
Semarize is built around the evaluation contract model that makes deterministic QA possible at scale. Each rubric criterion is defined as a Brick: one specific question, one defined output type (boolean, score, or text), one evidence standard. A QA Kitcollects the Bricks for your full rubric - resolution quality, empathy demonstrated, script adherence, escalation handling, compliance disclosure - and the same Kit runs against every call. The output is typed JSON: consistent fields, supporting quotes, the same structure every time. Because the schema is locked and versioned, model updates don't change scoring behaviour - results stay stable as your call volume grows.
Knowledge groundingis what separates deterministic from generic. Attach your compliance standards and the compliance Brick checks against your specific disclosure requirements - not a model's inference of what regulatory language looks like in general. Attach your escalation policy and the escalation Brick checks against your defined threshold. Attach your call script and the script adherence Brick checks for your specific required language at each point in the call. Critically, each Brick accesses only the knowledge relevant to its specific question - attention isn't diluted across the full knowledge base, so the evaluation is precise and calibrated to your standards rather than an industry average. The structured outputs land in your QA tooling and coaching workflows via the API after each call, without a manual review step.
Semarize runs deterministic QA rubrics against every call and returns structured scores with supporting evidence. Define the rubric once; score every call the same way.
Common questions
How do you make a QA rubric deterministic instead of subjective?
Translate each scoring criterion into a specific observable fact in the transcript. Instead of “did the agent demonstrate empathy?” (a judgment call), write “did the agent use language acknowledging the customer's frustration before moving to resolution?” (an observable fact). For each criterion, define what evidence would need to appear in the transcript for it to score positively. If the criterion can't be written as an evidence test, it either needs reformulation or a flag for human review. The practical test: two people running the same criterion against the same call should reach the same result without discussing it.
What if our current QA categories are based on reviewer style rather than customer outcomes?
That's the most common starting condition. Most rubrics accumulated over time as individual QA managers added criteria that matched their intuitions about good calls. The fix is to audit each field: does it score what the agent did, or what the customer experienced as a result? Fields that score agent behaviour - tone of voice, fluency, energy - are reviewer-style criteria. Fields that score customer outcomes - problem confirmed resolved, next steps clearly communicated, follow-up action agreed - are outcome criteria. Start by moving even one or two fields to outcome evidence; that shift alone often changes what coaching focuses on.
How do we score empathy and resolution quality without relying on human judgment?
Empathy scoring from transcripts works best when operationalised as specific language patterns: acknowledgement phrases, apology language, validation statements before redirection. Resolution quality is scored by checking whether the stated problem was addressed before the call ended - a yes/no based on whether the customer's issue appears resolved in the closing exchange or confirmed resolution language. Both become deterministic when the evidence standard is specific enough. The key is defining what the transcript must contain for each field to score positively, not asking the evaluation to make a general judgment about call quality.
Do we still need any manual review at all, or can we go fully automated?
Manual review still matters for two specific purposes: rubric calibration and complex judgment calls. Calibration means reviewing low-confidence scores and edge cases to refine the evaluation logic as call patterns evolve. Complex judgment calls are cases where the transcript evidence is ambiguous or context the rubric can't capture is relevant - calls with potential regulatory implications, unusual complaint patterns, or circumstances where the agent's decision was defensible despite not matching the rubric expectation. Expect manual review to settle around 2–3% of calls, focused on these cases rather than general coverage.
What structured outputs should we store so coaching and reporting stay consistent?
Store the field-level outputs as typed values - booleans for compliance and adherence checks, scores for quality dimensions - alongside the supporting quote for each field. The quote is what makes the score actionable: a coaching conversation built on “your empathy score was 2” produces a different outcome than one built on “here's the specific moment where acknowledgement was missing and what the customer said next.” Store these at the call level, linked to the agent and date, so you can trend fields across time, compare cohorts, and identify whether coaching is moving the scores you trained it to move.
Continue reading
Read more from Semarize
AI Scorecards Don't Disagree. Your Prompt Does.
Inconsistent AI scorecards aren't an AI problem - they're a process failure. Freeform prompts ask the model to re-interpret evaluation criteria on every run, and that interpretation drifts with phrasing, model updates, and context. The fix is an evaluation contract: a locked schema with defined output types that produces the same result on the same call, every time.
Bricks and Kits: the mechanism for stable conversation evaluation
Freeform prompts produce inconsistent evaluation results - scores drift, output shapes change, and you can't tell whether coaching improved anything or whether the rubric moved. Bricks define a locked evaluation schema: one question, one output type. Kits group them into reusable evaluation workflows. The result is schema-stable conversation analysis you control.
Conversation Intelligence Doesn't Fail on Calls. It Fails on Knowledge.
Early CI tools were built on ML classifiers - talk ratios, question counts, keyword detection. LLMs changed what's possible. But they introduced a new risk: model knowledge. When scoring runs against what the AI infers from training rather than your pricing, ICP criteria, and qualification playbooks, outputs are plausible and wrong.