Which models know sales?
·4 min read
Everyone is running coding benchmarks. I wanted to know which models actually know sales.
So I built salesevals.com, a benchmark where models read sales calls and write coaching notes. The first run has 25 synthetic B2B calls, 18 model configurations, and 450 judged coaching outputs.
The winner was gpt-5.4 high, barely. I shared the launch thread on X.
Sales coaching is hard to evaluate because a plausible answer is cheap. Any decent model can summarize a call, compliment the seller, name two generic risks, and sound useful.
I wanted to test sales instincts. Can the model notice that the rep did shallow discovery? Can it catch a technical miss? Can it praise a strong sales engineer without inventing fake criticism? Can it tell the difference between a weak objection answer and a pretty good one?
That requires ground truth.
Start with the answer key
Most sales call evals accidentally test whether the model can write a polished recap.
You give it a transcript. It writes a coaching note. Then you eyeball the note and ask, "does this sound right?" That is useful for product demos, but it is a soft eval. A confident model can pass by sounding like a sales manager.
The trick was generating the hidden answer key before the transcript exists.
Each call starts with a scenario:
- seller company
- buyer company
- call type
- duration
- target turn count
- quality profile
- persona notes
The quality profile is flawed, excellent, or mixed. A flawed call might have shallow BANT discovery, bad research, poor objection handling, a missed technical point, or a seller who talks too much. An excellent call might have strong account prep, open-ended discovery, a great solutions consultant, or a crisp mutual action plan.
The benchmark needs both. If every call is flawed, models learn to criticize everything.
Before writing the transcript, an LLM designs the hidden coaching truth. I call each hidden item a needle. A needle can be a flaw or a strength. It includes the behavior to look for, where it should appear, what evidence counts, what evidence should be rejected, and what the coaching implication is.
The schema looks roughly like this:
export const coachingNeedleSchema = z.object({
kind: z.enum(['flaw', 'strength']),
category: z.enum([
'discovery',
'research',
'technical_knowledge',
'objection_handling',
'communication_style',
'qualification',
'executive_alignment',
'next_steps',
'value_alignment',
'customer_enablement',
]),
label: z.string(),
evaluatorNeedle: z.string(),
transcriptPlacement: z.string(),
subtlety: z.enum(['obvious', 'moderate', 'subtle']),
expectedEvidence: z.array(z.string()),
antiEvidence: z.array(z.string()),
coachingImplication: z.string(),
})
Now the judge can score behavior, evidence, and false positives instead of polish.
The tested model never sees this answer key. It only sees the visible call package. The judge sees both.
The personas carry the truth
Once the hidden answer key exists, I use it to write the personas.
Each persona gets a role, backstory, objectives, objections, speaking style, private call behavior, and coaching signals. The most important field is callBehavior. That is where a label becomes a person.
If the call has a seller flaw, a seller persona naturally carries that tendency. If the call has a seller strength, a seller persona has the prep, skill, or instinct needed to show it. Buyer personas create the moments needed for the eval: skepticism, technical pushback, budget pressure, executive priorities, confusion, urgency.
A seller who talks too much does it because their persona behaves that way. A buyer who is confused about a technical point creates the moment for a strong solutions consultant to help. A CFO concern only half-resolves because the buyer keeps pressure on it.
That is when the calls started to feel real.
Write the call one turn at a time
My first transcripts were bad.
Every message was the same length. People skipped real introductions. Closings sounded like a memo. Nobody said the boring useful things people say on actual calls, like "great meeting y'all, I'll send over a Slack message with next steps."
I changed the generation shape.
Each transcript turn is its own LLM call. The model gets the call context, company research, personas, hidden eval design, allowed speaker IDs, recent turns, and a small naturalness context. It chooses the next speaker and writes only that person's next spoken contribution.
That naturalness context tracks:
- conversation stage
- recent turn lengths
- how many short turns have happened
- whether the call is in opening, discovery, middle discussion, or closing
Then the prompt asks for the parts that make a transcript feel like a call:
- true introductions
- uneven turn lengths
- quick acknowledgements
- brief handoffs
- buyer caveats
- normal wrap-up language
- concrete next-step logistics
For a 40-turn call, that is at least 40 LLM calls just for the transcript. Add research, hidden eval design, personas, artifacts, coaching, judging, and retries, and a single benchmark case quickly becomes a workflow.
That is why the generator uses Vercel Workflow DevKit. The workflow can run for a while, save artifacts as it goes, and resume when one model call fails.
The high-level shape is simple:
export async function generateMockSalesCallWorkflow(
input: CallGenerationInput,
) {
'use workflow'
const normalized = await normalizeInputStep(input)
const research = await researchCompaniesStep(normalized)
const evalDesign = await designCoachingEvalStep({
input: normalized,
research,
})
const personas = await draftPersonasStep(normalized, research, evalDesign)
const transcript: TranscriptTurn[] = []
for (let turnIndex = 0; turnIndex < normalized.turnCount; turnIndex += 1) {
const turn = await generateTranscriptTurnStep({
input: normalized,
research,
personas,
evalDesign,
turnsSoFar: transcript,
turnIndex,
})
transcript.push(turn)
}
const { manifest } = await buildArtifactsStep({
input: normalized,
research,
personas,
evalDesign,
transcript,
})
return persistCallStep(manifest)
}
Yes, this is a lot of model calls. A single prompt asking for "a realistic sales call" will overfit to whatever the model thinks a sales call sounds like. Turn-by-turn generation gives the transcript room to wander, recover, interrupt, clarify, and close.
Use LLMs for judgment
For this benchmark, the semantic steps belong to models.
Classification, routing, tagging, bucketing, grading, and coaching are judgment problems. Brittle string checks collapse good discovery into phrase matching. A model should read the situation and explain itself against a schema.
All the AI calls go through Vercel AI Gateway with the AI SDK. The shared helper calls generateText, requests structured output, and validates the response with Zod.
const { output } = await generateText({
model: selectedModelId,
system,
prompt,
temperature,
output: Output.object({
name: 'StructuredResponse',
schema,
}),
providerOptions: {
gateway: {
tags: ['mocked-zoom', ...tags],
},
},
})
return schema.parse(output)
Deterministic code handles IDs, storage, retries, routing, and artifact shape. Models handle the semantic work.
I used that pattern for research synthesis, hidden eval design, persona drafting, call-type bucketing, coaching, and judging.
The app has no fake fallback data for missing AI Gateway credentials. If the key is missing, generation fails. Silent fallback data would poison the eval.
The coach sees the call
After the generator creates a call, the coach workflow builds a visible prompt for each model configuration.
The coach sees:
- setup
- company research
- participant context
- speaker-labeled transcript
The hidden fields stay with the judge:
- hidden ground truth
- needle labels
- evaluator notes
- quality profile
The prompt asks for useful sales coaching. It tells the model to cite evidence, avoid invented facts, prioritize the biggest coaching points, and allow great calls to be mostly positive.
That last part matters. A great call should earn mostly positive coaching. Sometimes the right answer is "this rep nailed it, keep doing these two things."
The judge sees the answer key
The judge workflow loads the hidden answer key, the visible call, and the coach output.
It scores whether the coach found each needle, partially found it, missed it, contradicted it, or invented unsupported claims. Then it produces an eight-axis scorecard:
- overall score
- needle recall
- evidence grounding
- false-positive control
- prioritization
- actionability
- sales instinct
- technical accuracy
The judge model for this dataset was openai/gpt-5.5 with high reasoning.
LLM-as-judge is imperfect. I still prefer it here because the task is semantic. A string matcher would reward phrase overlap instead of sales judgment.
Because the judge is an OpenAI model, I would not overread tiny OpenAI-vs-non-OpenAI gaps. The result I trust most is the cluster: the top frontier models are close, cheaper models can still be useful, and the weaker runs miss more of the hidden coaching truth.
The first run
The first suite has 25 synthetic calls across startups, growth companies, large enterprises, and Fortune 10 buyers. The sellers are tech companies. The calls vary by type, duration, turn count, and quality profile.
The dataset has:
| Item | Count |
|---|---|
| Synthetic calls | 25 |
| Model configurations | 18 |
| Judged coach outputs | 450 |
| Hidden needles | 135 |
| Flaws | 64 |
| Strengths | 71 |
| Score axes | 8 |
The quality mix is intentionally balanced:
| Quality profile | Calls |
|---|---|
| Flawed | 9 |
| Excellent | 9 |
| Mixed | 7 |
Durations range from 18 to 74 minutes. The average is 43.5 minutes.
The model set included GPT-5.4, GPT-5.5, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview, with multiple reasoning levels where supported.
Every coach run used the same visible case material. Every judge run used the same hidden ground truth for that case.
The top five configs on overall score were:
| Rank | Model config | Avg score |
|---|---|---|
| 1 | gpt-5.4 high | 92.04 |
| 2 | gpt-5.5 none | 92.00 |
| 3 | gpt-5.5 xhigh | 92.00 |
| 4 | gpt-5.4 xhigh | 91.96 |
| 5 | gpt-5.5 high | 91.72 |
You can browse the current results at salesevals.com.
Cost should match the user task
The results site estimates the cost a real user would pay to receive the coaching.
That means the cost includes the coach model input and output. It excludes benchmark generation and judge costs because those are my eval harness, not the user's coaching product.
The whole experiment was more expensive than that. Between transcript generation, hidden answer keys, judges, retries, and cross-model runs, I burned about 24 million tokens and $197 getting the first suite together. I posted the cost breakdown on X.
That cost belongs to the benchmark builder. The product question is cheaper and more useful: if a user uploads one call and asks for coaching, what does that coach model cost?
In this dataset, estimated coach-run costs ranged from about $0.004 to $0.21 per call. The average was about $0.11 per call. Total estimated coach cost across 450 runs was $49.30.
That is an estimate. It is still useful for comparing configurations.
Make the artifacts shaped like production
The benchmark came from a larger mocked Zoom project.
The downstream app ingests Zoom recordings, so each generated call has Zoom-shaped artifacts: meeting metadata, participants, recording files, a VTT transcript, optional audio, and a stable recording.completed webhook payload.
That matters because the synthetic case can move through the same kind of ingestion flow as a historical Zoom recording. The current public benchmark scores transcript-grounded coaching, but the media shape leaves room for audio, video, screen-share, and asset-observation evals later.
I care a lot about this part. Toy evals are easy to make. Useful evals need to sit close to the product surface they are testing.
Decimals can make a spreadsheet look scientific. A useful benchmark lets a good sales leader open the transcript, read the coaching, and say whether the model caught the thing that mattered.
P.S. If any failed startup is reading this, I will give you $500 for your sales call transcripts. I am only half joking.

The newsletter
Don’t miss the next one.
Field notes on GTM engineering and the craft of shipping software in the AI era — straight to your inbox.
No spam. Unsubscribe anytime.