Knowledge Bases - Drew Bredvick

One of the hardest problems we face while building out agents at Vercel in our GTM Engineering group is building and maintaining an accurate knowledge base.

All companies essentially need to re-create their own mini-Google internally across:

Slack channel content
emails
sales call recordings
CRM updates
and product usage info.

This is a lot of data. This is not an easy task and is a hard engineering problem that can not be vibe coded away.

There are great tools out there that make building these easier, but no one actually has a batteries included version yet, so let me outline some of the work we've done to build ours internally, what I see as the opportunity, and products in the space that seem to be going in the right direction.

What we built

The first use case we covered at Vercel was account-focused knowledge bases. Think of this as a per-customer knowledge base that includes:

Gong call recordings
Emails
Slack channel messages

All of this data gets indexed into a customer-specific namespace that can be queried using both vector embeddings with cosine similarity as well as full text search.

We created this pipeline using Vercel's new Workflow SDK, which triggers both periodically and with webhooks to process data:

Daily for Slack messages
As soon as recordings are ready for sales calls

Workflows makes this code very simple to write because you don't have to worry about the traditional headaches associated with serverless compute on Vercel. These can be long-running background tasks that have built-in retries and durability.

Let's look at one high-level example:

import { stepGetGongData, stepSaveGongJob, stepUpsertGongData } from './steps';
import { GongWebhook } from '@/lib/types';

/**
 * On each workflow run:
 * - get gong data and parse as markdown
 * - upsert gong data to turbopuffer
 */

export async function workflowGongIngest(data: GongWebhook) {
  'use workflow';

  const sfdcAccountId = data.callData.context
    ?.find((context) => context.system === 'Salesforce')
    ?.objects?.find((object) => object.objectType === 'Account')?.objectId;

  const markdown = await stepGetGongData(data);

  if (sfdcAccountId && markdown) {
    await stepUpsertGongData(sfdcAccountId, markdown, data);

    await stepSaveGongJob(data, markdown, sfdcAccountId);
  }

  return true;
}

This process throws data into Turbopuffer, which is a new take at a modern vector database that scales much more cost effectively by using novel storage techniques. Turbopuffer has a few large customers that you probably know, including Cursor and Notion. They've been very helpful at helping us design some of our queries and giving us feedback on our implementation.

How we use it

This knowledge base can now be queried across our many different applications that serve GTM teams.

Most usefully is one of our applications called Dealbot, which reviews sales activities, call recordings, and Slack messages to coach sales reps. This means that we can query semantically and pull relevant contexts into deal reviews, postmortems on closed ones, and notes.

If you've worked in sales before, you know it's impossible to stay on top of every single call recording. This tooling helps us get up to speed quickly on important deals and coach our teams at scale.

The great part of building out this knowledge base is that it will not only be used in Dealbot but also every other application we build going forward. This is a centralized service that all of our different applications we've built can call right now. This is all happening in code in applications we build, and the value we're seeing is outstanding.

The opportunity

Given all of this brutally hard engineering work to just get to the starting line to finally be able to query the data, you might assume it's worth looking at paid options before building this yourself. I agree.

The problem with existing solutions

The opportunity and the problem in this space is that you can't just simply index data from different sources without thinking about:

Your querying strategy
Your chunking strategy
The format of the content

To think about this more concretely, we make 4-5 Gong API calls per Gong transcript to build up a good canonical representation of a sales call in a way that we would like. This results in the best query performance.

Simply taking the data from one call and dumping it in isn't sufficient. You need to build up these canonical representations for each data source. While some of this is specific to your business, it is more specific to the integration:

Gong calls should all be represented in a similar way
There is a great representation of Slack threads that works best

What the market needs

The opportunity here is to build out a catalog of services that you not only integrate with but also know the ideal chunking, embedding, query rewriting, and re-ranking all in one.

You obviously need this to be extensible as well, but shipping good defaults here should not be that hard. A service like this provides a lot more value than the raw compute and storage that a vector database solution provides.

This domain expertise that you can build a moat around can serve as:

Integration marketing
Partnership-led distribution
A technical moat powered by the sheer volume of integrations and domain expertise that maps to embeddings, retrieval, etc.

The good news is the market for this problem is massive. Your potential customers range from a 10-person startup all the way to the Fortune 10. Everyone has a knowledge base that they need to build in order to power their agents. And there's not a single solution I've found that seems built for this purpose yet.

Top contenders

While no one has shipped this perfectly yet, I think there are some folks poised to seize the moment.

Glean

One is obviously Glean, but after some sales conversations with Glean talking about the API, it's clear it's not their main priority. Glean is leaning into both their agent builder and chat and org knowledge base, but exposing it to folks as a ChatGPT internal competitor is not as a tool that is developer-first.

The sales org they built out is not developer-first, and as a result, pivoting into this would likely be difficult. Plus, these developer-led tool sales would likely have lower seat count, and their current monetization model does not align well.

Vector databases

The next companies poised to benefit here are the vector databases themselves. Think:

Turbopuffer
Pinecone
PG Vector in Postgres

Obviously the more embeddings that are created in the world, the better for these folks. So the fact that semantic search is the best option for these will continue to be a tailwind.

One problem though is that these smaller internal use cases, while providing a ton of value, their pricing is more in line with serving large volumes like Cursor and Notion. So for an individual company using something like Turbopuffer is a steal relative to the value you are getting.

Chroma

One recent company that seems to get this is Chroma. Chroma has recently shipped WebSync as well as GitHub sync, which are doing exactly as I'm suggesting here. Batteries included indexing plus trivial query advice.

If Chroma continues to build out these connectors, I think it's a very compelling offering. They'll have to figure out how to price value correctly here without being too expensive, which could be tricky, but these premium integrations and add-ons could give them a great way to segregate customers by propensity to pay.

For example, maybe the Notion integration is cheap and free in the Chroma indexing, but in order to connect to Salesforce is very expensive. Integrations have this nice ability built into them, and Zapier has even done this historically, so there is precedent.

Llama Index

Llama Index was a project that seemed to be leaning into this, except they were taking an open source approach. They seem to be over-rotating into PDFs specifically and documents, but I think this adapter context makes a lot of sense as a primary focus and not just one of many things your library handles.

Conclusion

Every company will need to build and maintain their knowledge base as a source of grounding for their agentic tooling over the next decade. The early adopters will build this in-house (like us), but when the mass market is ready, an off-the-shelf tool sure would be nice.