Sales Intelligence

Conversation Intelligence Doesn't Fail on Calls. It Fails on Knowledge.

Published April 5, 2026·Updated May 4, 2026·7 min read·Alex Handsaker

Conversational intelligence has evolved a lot from its early iterations - the first generation of tools were built on meeting bots, transcription tools, and machine learning classifiers, not large language models. Those systems couldn't actuallyread a conversation - they could recognise patterns in it. Talk ratio, question rate, keyword frequency, filler word detection, acoustic sentiment scoring. The metrics weren't arbitrary: they were the outer limit of what the technology could reliably detect at scale.

Those metrics became the standard. A generation of sales teams built coaching programmes around talk ratios and question counts, not because those things were the best measure of a good conversation, but because they were the only ones available. The technology shaped the definition of quality, and the definition stuck.

What ML-based CI could and couldn't do

Machine learning classifiers are good at one thing: recognising patterns in data they've been trained on. Applied to call recordings, this means detecting speaking time, counting question inflections, flagging keywords, and scoring sentiment from tone and pace. These are real signals - they correlate with good calls often enough to be worth tracking. The problem is that a classifier cannot understand a conversation; it can only measure features of it.

That distinction mattered a lot in practice. A rep could ask six open questions, hit the right talk ratio, and avoid every flagged keyword while the buyer left the call with a complete misunderstanding of the product. The classifier would score it well. The conversation didn't work. But the scorecard had no way of knowing that, because it was measuring observable features rather than conversational outcomes.

Coaching built on those scores improved feature compliance, not conversation quality. Reps learned to ask more questions, talk less, say the right words at the right time - and the underlying problem of whether buyers actually understood anything remained invisible.

Hand-sketched diagram showing ML conversation intelligence classifiers measuring talk ratio, question count, keywords, and sentiment while missing whether the buyer understood. — ML-era conversation intelligence could measure surface features, not buyer understanding.

What LLMs unlocked

Large language models changed the game. A model that can read and understand language can answer questions that ML classifiers never could: did the buyer articulate a specific problem in their own words? Did they demonstrate understanding of the pricing model? Did they commit to a next step, or just agree to think about it? These are semantic questions - they require reading the meaning of what was said, not just the features of how it was said.

For the first time, it became possible to evaluate a call against outcomes rather than behaviours. Not “did the rep ask a question?” but “did the buyer confirm they understood the solution?” Not “was next steps mentioned?” but “was a next step committed, with a specific owner and date?” The shift from measuring inputs to measuring outputs - from rep behaviour to buyer understanding - became technically achievable.

The possibilities expanded dramatically. Qualification depth, competitive handling, pricing comprehension, stakeholder alignment - these are all questions a well-designed LLM evaluation can answer reliably. The tooling available to revenue teams is genuinely different now, and the gap between what early CI could measure and what modern evaluation can measure is substantial.

The new risk: model knowledge

But LLMs introduced a risk that ML classifiers never had: they come with their own knowledge baked in. A classifier detects patterns; an LLM interprets meaning. And to interpret meaning, it draws on everything it learned during training - which means it has views about what good looks like, what qualified means, what pricing should be, what a strong discovery call produces. Those views are based on the vast body of text the model trained on: sales books, blog posts, call transcripts from across the internet, generic frameworks for how B2B sales is supposed to work.

This is model knowledge, and relying on it is a risky game. The model's understanding of “good discovery” is a generalisation across thousands of different sales motions, product types, and buyer contexts. It has nothing to do with your qualification criteria, your pricing structure, or your definition of what a qualified opportunity looks like in your market. When the system evaluates a call against model knowledge rather than your knowledge, it produces outputs that sound plausible and authoritative - but may be systematically wrong for your business.

The failure is subtle precisely because LLMs are fluent. A classifier that can't assess buyer understanding simply doesn't report on it. An LLM will report on it confidently, based on what it infers from training - and the inference might be wrong in ways that take months to notice. A rep who is flagged for weak qualification might be following your process correctly; the model just disagrees because it's comparing against a different standard. A deal scored as low-risk might be high-risk by your criteria, but the model doesn't know your criteria.

Why deal-level dashboards make this worse

Most CI platforms organise their outputs around deals: which calls happened, which signals were detected, which opportunities are at risk. This framing is useful for pipeline visibility, but it makes the model knowledge problem harder to see.

When the output is a deal board or a risk summary, the underlying inference is invisible. A deal flagged as at risk looks like a data point, not like an opinion formed from a generic model assumption. The platform surfaces a signal; it doesn't surface the fact that the signal was formed without access to your rate card, your ICP definition, or your deal desk rules. Those things exist in documents in your systems - not in the model's training data - and the dashboard never tells you which one the evaluation was based on.

Domain knowledge vs model knowledge

Model knowledge is the AI's trained understanding of sales in general. Domain knowledge is what your organisation specifically knows: your pricing tiers, your ICP definition, your qualification playbook, your competitive positioning, your compliance requirements. These are two different things, and conflating them is where most modern CI implementations go wrong.

A model might know that budget is a qualification signal, but it cannot assess whether the number the buyer mentioned represents actual alignment with your pricing model. It knows a competitor mention matters, but it cannot evaluate how your product compares to the specific competitor raised - or whether the rep handled it correctly according to your battle cards. It can identify that a next step was mentioned, but it cannot confirm whether the proposed next step matches your deal desk requirements for advancing a deal.

The gap between model knowledge and domain knowledge is where evaluations go quietly wrong. Not wrong in a way that's immediately visible - the scores look reasonable, the coaching summaries read sensibly - but wrong in the sense that the system is applying one set of standards when your business runs on another.

Hand-sketched comparison of model knowledge from generic sales patterns and domain knowledge from pricing, ICP, playbook, and battlecard documents feeding AI evaluation. — Fluent evaluation is not the same as evaluation grounded in your operating rules.

What knowledge grounding actually means

Knowledge grounding means attaching your organisation's documents to the evaluation - pricing sheets, ICP criteria, qualification playbooks, approved product claims - so that scoring runs against your reality rather than a generic model assumption. When the system evaluates whether pricing was discussed accurately, it checks against your rate card. When it assesses qualification depth, it checks against your criteria.

This matters because the signals you actually want - “did the buyer demonstrate understanding of our pricing model?”, “were the right qualification criteria addressed?”, “did the rep accurately represent the product for this buyer's use case?” - can only be answered reliably with that grounding in place. Without it, the system is pattern-matching to its training rather than checking against your definitions of what good looks like.

When your pricing changes, you update the document. The evaluation logic stays the same; the accuracy improves because the knowledge the system retrieves is current and specific, rather than inferred from training data that knows nothing about your business.

Hand-sketched Semarize knowledge-grounded evaluation workflow showing a transcript and knowledge base feeding Kit and Bricks, returning grounded JSON with answer, confidence, and evidence. — Grounded evaluation checks the transcript against your documents and returns structured evidence.

Closing the knowledge gap

The practical step is to audit your current scoring rubrics and identify where they assume domain knowledge is present but haven't actually supplied it. Which criteria could only be evaluated accurately if the system knew your actual pricing? Which qualification signals require your specific definition of qualified? Where is the system currently inferring from model knowledge instead of checking against your rules?

Once those gaps are identified, the work is to build the knowledge base that fills them - one document per domain area, structured for retrieval, attached to the evaluation schema. The question you're trying to answer then shifts: from “did the rep follow the framework?” to “did the buyer demonstrate understanding of the problem, the product, and the next step?” - grounded in what your business defines those things to mean.

Teams that wire their domain knowledge into the evaluation layer get fundamentally different signal back from teams that rely on model knowledge. Scoring becomes specific rather than generic, comparable over time, and connected to the standards your business actually runs on. Coaching connects to real gaps in buyer understanding rather than to what a model inferred from a generalised view of sales. That's the difference between a CI programme that steers decisions and one that produces plausible-sounding noise.

How Semarize grounds conversational intelligence with knowledge

Semarize is built around the problem this article describes. Rather than running evaluation against model assumptions, it gives you a structured schema you control - built from Bricks : discrete evaluation units, each asking one specific question and returning one consistent output. You define what you want to measure; the model evaluates against that definition, not against its own.

Knowledge grounding is the layer that makes those evaluations accurate rather than plausible. Attach your pricing sheets, ICP criteria, qualification playbooks, and competitive battle cards to a Kit, and every Brick in that Kit evaluates against your documents rather than against what the model assumes is generally true. When your pricing changes, you update the document - the evaluation logic stays the same, and the accuracy stays current.

The output is structured data: consistent fields, confidence scores, and evidence spans pointing to the exact lines in the transcript that support each answer. That structure is what makes the results comparable across reps and over time, and what makes it possible to connect scoring to deal outcomes rather than to scorecard compliance.

For teams moving off ML-era metrics - or off freeform LLM prompts that drift - Semarize provides the evaluation layer that bridges the gap: semantic understanding grounded in your knowledge, not in a generic model's best guess.

Start building →

Common questions

Why did early conversation intelligence tools score calls on ratios and question counts?

Early CI platforms were built on machine learning classifiers, which detect patterns rather than understand meaning. Talk ratio, question rate, and keyword frequency were the signals those systems could measure reliably. LLMs changed what's technically possible - but many scoring frameworks inherited from the ML era haven't been updated to reflect it.

If LLMs can understand language, why aren't they automatically measuring buyer understanding correctly?

LLMs can read meaning, but they evaluate against their training data unless you supply specific domain knowledge. Without your pricing, ICP criteria, and qualification playbook attached to the evaluation, the model infers what “good” looks like from a generalised view of B2B sales - not from your business. The result is outputs that sound authoritative but may be systematically wrong for your specific context.

Which knowledge bases should we attach first?

Start with the documents that define what qualified looks like for your organisation: your ICP definition, your qualification criteria, and your pricing model. These are the areas where model inferences diverge most from your actual standards - and where inaccurate scoring does the most damage to coaching quality and forecast accuracy.

Where does Semarize fit if we already have Gong or Chorus?

Gong and Chorus produce call summaries, activity dashboards, and rep behaviour scores - largely based on the pattern-detection approach that predates knowledge-grounded evaluation. Semarize is an evaluation API: you define the schema, attach your domain knowledge, send transcripts, and get structured data back. Teams use it to build scoring that goes beyond what pattern-based tools support - buyer understanding signals, knowledge-grounded qualification checks, and consistent outputs that feed directly into CRM fields or coaching workflows.

Continue reading

AI Call Scoring Is Theatre Without a Knowledge Layer

AI call scoring that runs on a good LLM with a well-written rubric can look accurate until you test it against what actually happened. The failure isn't one missing check. Every commercial dimension worth assessing has multiple facets, and each facet requires its own grounded document to evaluate properly. A knowledge layer is what makes scoring checkable across all of them rather than plausible about none of them.

Read post

Product

Bricks and Kits: the mechanism for stable conversation evaluation

Freeform prompts produce inconsistent evaluation results - scores drift, output shapes change, and you can't tell whether coaching improved anything or whether the rubric moved. Bricks define a locked evaluation schema: one question, one output type. Kits group them into reusable evaluation workflows. The result is schema-stable conversation analysis you control.

Read post