Last Week in AI: Anthropic, more evals, and usage revenue

Hey there,

I started doing a weekly review of my Reflect notes and thought I'd start publishing a weekly lookback at what actually mattered from the week. Here's what made the cut:

Anthropic raised $65B at a $965B valuation and shipped Claude Opus 4.8 the same day. Claude Code now has dynamic workflows that can run hundreds of parallel subagents in a session. I also ran Opus 4.8 through sales evals this week, and it landed mid-pack.
Snowflake reported Q1 FY2027 results and basically showed the SaaSpocalypse counterexample in public. Product revenue was $1.33B, up 34% year over year, and Snowflake raised full-year product revenue guidance to $5.84B. I keep coming back to the same point: every agent wants context. Usage shows up somewhere.
DeepSWE is my new favorite coding benchmark and not just because it confirms my experience. The design of these evals feels more right: it's all from scratch, it's multi-language, and the verifiers test actual software behavior.

Two updates from me:

1. I published Which models know sales? this week.

I built salesevals.com, a benchmark where models read synthetic B2B sales calls and write coaching notes. It's a needle in a haystack approach where we generate call transcripts with known strengths and weaknesses to see if the coach catches them.

The live leaderboard now has 25 call transcripts and 575 judged outputs. GPT-5.4 high is still leading. Surprisingly, the new Opus 4.8 runs landed in the middle: high scored 89.4.

My read: labs (especially Anthropic) are focused on coding, and sales is still its own game. A model can sound like a sales manager pretty easily. Telling the difference between "good sales" and "bad sales" is still hard.

2. I also published Why the SaaSpocalypse is fake news.

tl;dr: agents create a ton of usage which increases revenue.

A human asks one question. The agent pulls from Snowflake, Salesforce, Slack, Notion, Gong, docs, email, and whatever else has the context. It retries. It asks follow-ups. It might even run on a recurring schedule. One user action becomes may billable activities.

The SaaS doomer story starts with seat compression. Usage-based pricing can flip that math: 80% of the seats using 10x more will mean MORE revenue, not less.

Dan Shipper has an interesting corollary of this on Lenny’s podcast: if the user brings the tokens, the app isn’t paying for all the intelligence it routes. OpenAI is trying to go this route as they're more open with where you can spend your subscription usage.

As for next week, rumor has it we might be getting GPT-5.6 and Apple is hosting WWDC, although I'm not entirely sure Apple belongs in an update about AI?

LFG,
Drew

Don’t miss the next one.