Developers

Conversation Intelligence for Developers: Don't Build a Fragile Pipeline, Don't Buy a Black Box

Published May 5, 2026·8 min read·Alex Handsaker

Engineering teams that decide to add call intelligence to a product typically reach for one of two options: build an LLM pipeline internally, or buy a conversation intelligence platform and use its API. Both options have failure modes that show up in production well after the initial build. The underlying problem in both cases is the same: the integration is fragile or uncontrolled. Model quality is rarely what breaks things. Structure is.

The signal problem is specific. You need to reliably extract and measure what buyers actually understood during a call: whether they confirmed a timeline, articulated a specific pain, named a decision process. If the output format isn't stable, or the fields measure rep behaviour instead of buyer understanding, the integration produces noise you can't act on. The question isn't whether the AI can read a transcript. It's whether it can return a field that holds its meaning run after run, so your downstream systems don't break and your coaching doesn't drift.

What “API-first” actually means for conversation intelligence

An API-first conversation intelligencelayer has three defining properties. First, one integration surface: transcript in, structured signals out. No bespoke NLP pipeline to maintain, no transformer to fine-tune. You send a POST with a transcript and a Kit code; the API processes it and returns structured JSON. Second, deterministic output you can version and test: the same Kit against the same transcript returns the same fields, in the same shape, every time. You can write assertions against the output in a standard CI pipeline and fail the build if field types shift or required fields go missing. Third, control over processing paths: async mode for batch workloads where the product flow doesn't wait on evaluation, and sync mode for real-time use cases where the next step needs the output before it can continue.

Hand-sketched API-first conversation intelligence workflow showing POST transcript and Kit code into the Semarize API and structured JSON output with boolean, score, list, and evidence fields. — API-first CI gives developers one surface: transcript and Kit in, typed JSON out.

Why the two default paths break in production

The internal LLM pipeline fails on structure. An LLM pipeline that ingests transcripts and returns prose analysis gives you insight but not schema. Output fields change between model versions. A freeform response that contained the signal you needed last month may structure it differently this month. The rest of your stack (dashboards, CRM field updates, scoring triggers) breaks when the output format shifts. The pipeline looks solid in development and becomes a maintenance burden in production.

The platform API fails on control. Most CI platforms return structured data within the constraints of their own data model: their call topics, their scoring dimensions, their deal signals. Their fields typically measure what the rep did, talk ratio, question count, framework adherence. These are behaviours the evaluation can detect easily. They don't predict deal outcomes reliably because they measure the rep's inputs, not whether the buyer understood anything as a result. You can't define your own extraction schema or add the specific fields your product needs. The API is real, but it's a read API on top of someone else's model, built for their use case, not yours. The evaluation contract framing covers why the schema is the variable that determines whether CI produces useful signals at all.

Hand-sketched comparison of engineering teams building an internal LLM pipeline with prose that changes shape or using a platform API built on someone else's model, both pointing to the need for control and stable schema. — The production failure mode is usually structure and control, not model capability.

The integration that engineering can actually ship

For engineers embedding call intelligence, the integration pattern that works in production is: assemble the Kit (the collection of specific questions and typed outputs you need), then send transcripts to the API and route the structured JSON into your product's data layer. In Semarize, Kits are built in the app; the evaluation design is separate from the API call. The developer integration is a single POST to /v1/runs with the Kit code and the transcript as input. The API handles the evaluation and returns a consistent JSON structure: one named output per Brick in the Kit, each with a typed value, a confidence score, a reason, and supporting evidence from the transcript.

Processing is available both async and sync. Async (the recommended path for production) returns a run ID immediately and delivers results to a webhook or poll endpoint when evaluation completes. Sync blocks and returns the full output inline, falling back to async if it exceeds the timeout. The signal routing is standard: connect brick outputs to Zapier, Make, n8n, or Clay to populate CRM fields, trigger coaching tasks, or feed a data warehouse row. The complete integration surface is one API call per transcript; everything downstream is typed JSON fields and standard routing logic. The AI evaluation use case covers the production rollout pattern in more detail.

For teams who want to build and iterate on Kits using AI, MCP is the second integration route. Connect Claude Desktop, Cursor, or a custom agent as a machine agent with draft and inspection permissions. From that context, Claude can list existing Bricks, propose new Kit structures, and draft evaluations for a new call type or rubric. Publishing remains human-governed in the app: MCP is a building and review surface, not a publishing mechanism. The two routes are complementary; use MCP to build, use the REST API to run.

Stop judging demos, start judging control

The questions that matter for engineering evaluation are different from the questions that matter for a sales buyer. Skip the demo. Ask the vendor to run the same transcript through the same evaluation twice and compare the outputs. If field values differ on re-run, the evaluation isn't deterministic and you can't write reliable tests against it. An evaluation that isn't deterministic produces scorecards that can't be trusted for trend analysis, coaching decisions, or downstream automation. The call-level noise exceeds the signal you're trying to surface.

The evaluation contract question is the other critical one: what happens when the underlying model updates? If schema changes are silent rather than versioned, the API will create maintenance burden in production, the same maintenance burden you were trying to escape. Start with the smallest viable loop in production: one Kit, a handful of fields, one call type. Validate that outputs are landing correctly in their downstream destination. Expand the schema once the loop is stable. The common failure mode is over-scoping the initial extraction, trying to capture twenty fields before validating that the first five are reaching where they should.

Hand-sketched developer testing workflow showing held transcripts run through a Kit in CI with assertions for field presence, type stability, valid range, and no drift. — Structured CI output is useful to engineering because it can be tested like any other contract.

How Semarize is built for developers

Semarize is designed around this integration model. A Kit is a collection of Bricks; each Brick asks one specific question and returns one typed output: boolean, score, text, or categorical label. The Kit is built in the Semarize app (or drafted by an AI agent via MCP and published by the developer). Once built, the developer integration is a single API call: POST /v1/runswith the Kit code and transcript. The response is consistent JSON, one named output per Brick, each with value, confidence, reason, and evidence. Because the Kit is versioned in the app, model updates don't change field names, types, or output structure. Your downstream systems keep working. Your CI assertions keep passing.

The signal architecture is buyer-side by design. Bricks ask about what the buyer did and said (whether they confirmed a timeline, articulated specific pain, stated a decision process) and not just what the rep did. That's the compounding benefit of the output structure: not just stability, but signal quality that's worth building on. See the developer quickstart for the full request format, processing modes, and MCP setup.

Semarize returns typed structured JSON from every transcript against a versioned Kit. One API call, deterministic output, testable schema.

Start building →

Common questions

What signals should we extract from calls if we care about buyer understanding?

Start with buyer-side signals that directly reflect comprehension and commitment: pain articulation specificity (score), next step agreed (yes/no with supporting quote), timeline confirmed (yes/no with date), competitors mentioned (list), and decision criteria stated (yes/no). These tell you what the buyer understood and committed to, not what the rep said. They're also consistently extractable because buyers tend to state them explicitly when the right questions are asked. Avoid leading with rep-behaviour fields (talk ratio, question count) as your primary signals; they're inputs, not outcomes, and correlate weakly with deal results.

How do schema-stable JSON outputs prevent scorecard drift?

When the evaluation Kit is versioned (field names, types, and output structure locked until you explicitly change the Kit) outputs don't change when the underlying model updates. You can run tests against a held set of calls and assert that outputs haven't changed unexpectedly. Freeform LLM output doesn't give you this: the response structure depends on the prompt and the model version, neither of which you fully control across releases. Schema stability is what makes conversation intelligence testable in a standard CI pipeline.

Can we run conversation intelligence incrementally without slowing down our product?

Yes. The standard pattern is async processing: the transcript is submitted and the API returns a run ID immediately. Processing happens in the background; results are delivered to your webhook or available via polling when evaluation completes. The product flow doesn't wait. Typical processing time is seconds to a few minutes depending on call length and Kit complexity. For real-time use cases (in-call scoring, live guidance) sync mode is available, with results returned inline when evaluation completes, falling back to async on timeout.

How do we write tests for conversation intelligence outputs in a CI pipeline?

Maintain a held set of calls (ten to twenty transcripts with known characteristics) and run your Kit against them in your standard CI pipeline. Assert on field types (a boolean Brick should never return a string), required field presence (fields that should always produce a value should never return null), and value distributions (a score Brick should stay within its expected range). If any assertion fails after a Kit update, you've caught the regression before production. This is the deterministic testing that freeform LLM output doesn't support; the structured schema is what makes the assertions writable.

How do we evaluate a CI vendor's API beyond the demo?

Run your own transcripts through the API, not the vendor's curated examples. Check whether the output schema matches what you need: are field types correct, are required fields always present, does null handling work as expected? Run the same transcripts twice and compare outputs; if field values differ on re-run, the evaluation isn't deterministic. Ask what happens to output schema when the underlying model updates: if changes are silent rather than versioned releases, you'll inherit maintenance burden in production. Also assess the processing modes available and the fallback behaviour when sync processing exceeds its timeout.

Continue reading

Bricks and Kits: the mechanism for stable conversation evaluation

Freeform prompts produce inconsistent evaluation results - scores drift, output shapes change, and you can't tell whether coaching improved anything or whether the rubric moved. Bricks define a locked evaluation schema: one question, one output type. Kits group them into reusable evaluation workflows. The result is schema-stable conversation analysis you control.

Read post

Developers

Introducing the Semarize MCP

Today we're shipping the Semarize MCP. Connect Claude, Codex, or any MCP-compatible AI tool to your workspace and build evaluation schemas from inside a conversation: create Bricks, draft Kits, attach knowledge bases, and publish, without leaving the tool you're already working in.

Read post