Which AI Models Know Sales? A Synthetic Call Benchmark

salesevals.comEvaluated Apr 30, 2026

Leaderboard

Same 92.0 avg up top. Range is the delta.The top averages are tied when rounded. The winner had the tightest score range.

20 pt span / 92.0 avg

76-96

3gpt-5.5 xhigh

21 pt span / 92.0 avg

76-97

#	Model	Range	Span	Avg	Cost
1	gpt-5.4 highLeader GPT-5.4 / high	87-96	9 pt	92.0	$0.07
2	gpt-5.5 none GPT-5.5 / none	76-96	20 pt	92.0	$0.17
3	gpt-5.5 xhigh GPT-5.5 / xhigh	76-97	21 pt	92.0	$0.18
4	gpt-5.4 xhigh GPT-5.4 / xhigh	82-96	14 pt	92.0	$0.07
5	gpt-5.5 high GPT-5.5 / high	72-97	25 pt	91.7	$0.18

Live leaderboardsalesevals.com ->

Everyone is running coding benchmarks. I wanted to know which models actually know sales.

So I built salesevals.com, a benchmark where models read sales calls and write coaching notes. The first run has 25 synthetic B2B calls, 18 model configurations, and 450 judged coaching outputs.

The winner was gpt-5.4 high, barely. I shared the launch thread on X.

Sales coaching is hard to evaluate because a plausible answer is cheap. Any decent model can summarize a call, compliment the seller, name two generic risks, and sound useful.

I wanted to test sales instincts. Can the model notice that the rep did shallow discovery? Can it catch a technical miss? Can it praise a strong sales engineer without inventing fake criticism? Can it tell the difference between a weak objection answer and a pretty good one?

That requires ground truth.

Start with the answer key

Most sales call evals accidentally test whether the model can write a polished recap.

You give it a transcript. It writes a coaching note. Then you eyeball the note and ask, "does this sound right?" That is useful for product demos, but it is a soft eval. A confident model can pass by sounding like a sales manager.

The trick was generating the hidden answer key before the transcript exists.

Each call starts with a scenario:

seller company
buyer company
call type
duration
target turn count
quality profile
persona notes

The quality profile is flawed, excellent, or mixed. A flawed call might have shallow BANT discovery, bad research, poor objection handling, a missed technical point, or a seller who talks too much. An excellent call might have strong account prep, open-ended discovery, a great solutions consultant, or a crisp mutual action plan.

The benchmark needs both. If every call is flawed, models learn to criticize everything.

Before writing the transcript, an LLM designs the hidden coaching truth. I call each hidden item a needle. A needle can be a flaw or a strength. It includes the behavior to look for, where it should appear, what evidence counts, what evidence should be rejected, and what the coaching implication is.

The schema looks roughly like this:

export const coachingNeedleSchema = z.object({
  kind: z.enum(['flaw', 'strength']),
  category: z.enum([
    'discovery',
    'research',
    'technical_knowledge',
    'objection_handling',
    'communication_style',
    'qualification',
    'executive_alignment',
    'next_steps',
    'value_alignment',
    'customer_enablement',
  ]),
  label: z.string(),
  evaluatorNeedle: z.string(),
  transcriptPlacement: z.string(),
  subtlety: z.enum(['obvious', 'moderate', 'subtle']),
  expectedEvidence: z.array(z.string()),
  antiEvidence: z.array(z.string()),
  coachingImplication: z.string(),
})

Now the judge can score behavior, evidence, and false positives instead of polish.

The tested model never sees this answer key. It only sees the visible call package. The judge sees both.

The personas carry the truth

Once the hidden answer key exists, I use it to write the personas.

Each persona gets a role, backstory, objectives, objections, speaking style, private call behavior, and coaching signals. The most important field is callBehavior. That is where a label becomes a person.

If the call has a seller flaw, a seller persona naturally carries that tendency. If the call has a seller strength, a seller persona has the prep, skill, or instinct needed to show it. Buyer personas create the moments needed for the eval: skepticism, technical pushback, budget pressure, executive priorities, confusion, urgency.

A seller who talks too much does it because their persona behaves that way. A buyer who is confused about a technical point creates the moment for a strong solutions consultant to help. A CFO concern only half-resolves because the buyer keeps pressure on it.

That is when the calls started to feel real.

Write the call one turn at a time

My first transcripts were bad.

Every message was the same length. People skipped real introductions. Closings sounded like a memo. Nobody said the boring useful things people say on actual calls, like "great meeting y'all, I'll send over a Slack message with next steps."

I changed the generation shape.

Each transcript turn is its own LLM call. The model gets the call context, company research, personas, hidden eval design, allowed speaker IDs, recent turns, and a small naturalness context. It chooses the next speaker and writes only that person's next spoken contribution.

That naturalness context tracks:

conversation stage
recent turn lengths
how many short turns have happened
whether the call is in opening, discovery, middle discussion, or closing

Then the prompt asks for the parts that make a transcript feel like a call:

true introductions
uneven turn lengths
quick acknowledgements
brief handoffs
buyer caveats
normal wrap-up language
concrete next-step logistics

For a 40-turn call, that is at least 40 LLM calls just for the transcript. Add research, hidden eval design, personas, artifacts, coaching, judging, and retries, and a single benchmark case quickly becomes a workflow.

That is why the generator uses Vercel Workflow DevKit. The workflow can run for a while, save artifacts as it goes, and resume when one model call fails.

The high-level shape is simple:

export async function generateMockSalesCallWorkflow(
  input: CallGenerationInput,
) {
  'use workflow'

  const normalized = await normalizeInputStep(input)
  const research = await researchCompaniesStep(normalized)
  const evalDesign = await designCoachingEvalStep({
    input: normalized,
    research,
  })
  const personas = await draftPersonasStep(normalized, research, evalDesign)
  const transcript: TranscriptTurn[] = []

  for (let turnIndex = 0; turnIndex < normalized.turnCount; turnIndex += 1) {
    const turn = await generateTranscriptTurnStep({
      input: normalized,
      research,
      personas,
      evalDesign,
      turnsSoFar: transcript,
      turnIndex,
    })

    transcript.push(turn)
  }

  const { manifest } = await buildArtifactsStep({
    input: normalized,
    research,
    personas,
    evalDesign,
    transcript,
  })

  return persistCallStep(manifest)
}

Yes, this is a lot of model calls. A single prompt asking for "a realistic sales call" will overfit to whatever the model thinks a sales call sounds like. Turn-by-turn generation gives the transcript room to wander, recover, interrupt, clarify, and close.

Use LLMs for judgment

For this benchmark, the semantic steps belong to models.

Classification, routing, tagging, bucketing, grading, and coaching are judgment problems. Brittle string checks collapse good discovery into phrase matching. A model should read the situation and explain itself against a schema.

All the AI calls go through Vercel AI Gateway with the AI SDK. The shared helper calls generateText, requests structured output, and validates the response with Zod.

const { output } = await generateText({
  model: selectedModelId,
  system,
  prompt,
  temperature,
  output: Output.object({
    name: 'StructuredResponse',
    schema,
  }),
  providerOptions: {
    gateway: {
      tags: ['mocked-zoom', ...tags],
    },
  },
})

return schema.parse(output)

Deterministic code handles IDs, storage, retries, routing, and artifact shape. Models handle the semantic work.

I used that pattern for research synthesis, hidden eval design, persona drafting, call-type bucketing, coaching, and judging.

The app has no fake fallback data for missing AI Gateway credentials. If the key is missing, generation fails. Silent fallback data would poison the eval.

The coach sees the call

After the generator creates a call, the coach workflow builds a visible prompt for each model configuration.

The coach sees:

setup
company research
participant context
speaker-labeled transcript

The hidden fields stay with the judge:

hidden ground truth
needle labels
evaluator notes
quality profile

The prompt asks for useful sales coaching. It tells the model to cite evidence, avoid invented facts, prioritize the biggest coaching points, and allow great calls to be mostly positive.

That last part matters. A great call should earn mostly positive coaching. Sometimes the right answer is "this rep nailed it, keep doing these two things."

The judge sees the answer key

The judge workflow loads the hidden answer key, the visible call, and the coach output.

It scores whether the coach found each needle, partially found it, missed it, contradicted it, or invented unsupported claims. Then it produces an eight-axis scorecard:

overall score
needle recall
evidence grounding
false-positive control
prioritization
actionability
sales instinct
technical accuracy

The judge model for this dataset was openai/gpt-5.5 with high reasoning.

LLM-as-judge is imperfect. I still prefer it here because the task is semantic. A string matcher would reward phrase overlap instead of sales judgment.

Because the judge is an OpenAI model, I would not overread tiny OpenAI-vs-non-OpenAI gaps. The result I trust most is the cluster: the top frontier models are close, cheaper models can still be useful, and the weaker runs miss more of the hidden coaching truth.

The first run

The first suite has 25 synthetic calls across startups, growth companies, large enterprises, and Fortune 10 buyers. The sellers are tech companies. The calls vary by type, duration, turn count, and quality profile.

The dataset has:

Item	Count
Synthetic calls	25
Model configurations	18
Judged coach outputs	450
Hidden needles	135
Flaws	64
Strengths	71
Score axes	8

The quality mix is intentionally balanced:

Quality profile	Calls
Flawed	9
Excellent	9
Mixed	7

Durations range from 18 to 74 minutes. The average is 43.5 minutes.

The model set included GPT-5.4, GPT-5.5, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview, with multiple reasoning levels where supported.

Every coach run used the same visible case material. Every judge run used the same hidden ground truth for that case.

The top five configs on overall score were:

Rank	Model config	Avg score
1	gpt-5.4 high	92.04
2	gpt-5.5 none	92.00
3	gpt-5.5 xhigh	92.00
4	gpt-5.4 xhigh	91.96
5	gpt-5.5 high	91.72

You can browse the current results at salesevals.com.

Cost should match the user task

The results site estimates the cost a real user would pay to receive the coaching.

That means the cost includes the coach model input and output. It excludes benchmark generation and judge costs because those are my eval harness, not the user's coaching product.

The whole experiment was more expensive than that. Between transcript generation, hidden answer keys, judges, retries, and cross-model runs, I burned about 24 million tokens and $197 getting the first suite together. I posted the cost breakdown on X.

Drew Bredvick

@DBredvick

·Follow

This thread brought to you by 24m tokens, 197 dollars, and the AI gateway making evals cross model easy to write and codex, don't forget to thank codex

Drew Bredvick

@DBredvick

Working on GTM evals for sales calls, needle in the haystack style with LLM as judge. ⚠️This is for a single call (transcript fully generated from sub agents + an orchestrator), so not conclusive findings yet. The call has two folks on the account team (AE + SA) and two folks

7:02 PM · Apr 30, 2026

Make the artifacts shaped like production

The benchmark came from a larger mocked Zoom project.

The downstream app ingests Zoom recordings, so each generated call has Zoom-shaped artifacts: meeting metadata, participants, recording files, a VTT transcript, optional audio, and a stable recording.completed webhook payload.

That matters because the synthetic case can move through the same kind of ingestion flow as a historical Zoom recording. The current public benchmark scores transcript-grounded coaching, but the media shape leaves room for audio, video, screen-share, and asset-observation evals later.

I care a lot about this part. Toy evals are easy to make. Useful evals need to sit close to the product surface they are testing.

Decimals can make a spreadsheet look scientific. A useful benchmark lets a good sales leader open the transcript, read the coaching, and say whether the model caught the thing that mattered.

P.S. If any failed startup is reading this, I will give you $500 for your sales call transcripts. I am only half joking.

Which models know sales?