TL;DR

  • Token cost in proposal workflows is dominated by input, not output: long questionnaires, fat system prompts, oversized retrieval, and re-runs across reviewers.

  • Prompt caching, embedding caches, and semantic answer caches each save different money. They are complementary, not interchangeable.

  • Model routing — small models for classification and intent, frontier models for synthesis and edge cases — is where most teams find their first 40 to 60 percent of savings.

  • A governed answer library skips generation entirely for repeat questions. The cheapest token is the one you never spend.

  • Measure cost per accepted answer, not cost per call. Cost per call ignores rework.

  • Bottom line:The fastest cost wins come from spending less by retrieving better and reusing more — not from buying a cheaper model. Tribble is one approach to combining governed retrieval, model routing, and a reuse-first answer library for RFPs, DDQs, and security questionnaires.

Where token cost actually hides in proposal workflows

Most teams discover their AI token bill the same way: a finance partner forwards an invoice with a question mark in the subject line. The drafting tool that looked free in a pilot is now five figures a month, and nobody can explain why. The answer is almost always the same. Proposal and RFP workflows are unusually expensive for AI not because the model is doing anything exotic but because the shape of the work pushes every token-cost lever in the wrong direction.

Three structural facts make this a high-cost category. RFPs and security questionnaires arrive as large documents — often 80 to 400 questions in a single workbook — and a model needs context to answer each one. Each question typically pulls supporting passages from a knowledge base, multiplying retrieved content several times over. And the work is collaborative, which means the draft gets regenerated, re-prompted, edited, and re-reviewed by multiple humans, each of whom may trigger their own model calls.

Input tokens, not output tokens, dominate the bill. A typical answer might emit 200 to 600 tokens of generated text, but the request that produced it can include 6,000 to 30,000 tokens of context: the question itself, retrieved passages, system instructions, examples, and prior turn history. In retrieval-heavy workloads the input-to-output ratio frequently runs 20:1 or higher. Pricing this asymmetry rather than headline per-token cost is the first move toward sanity.

The cost stacks that surprise teams

When teams audit their spend honestly, the same patterns show up. Each is fixable, but only if you see it.

Oversized system prompts.A 2,500-token system prompt that includes the entire style guide, sample answers, formatting rules, persona, and constraints is sent on every call. Multiply that by 300 questions and the system prompt alone has consumed three quarters of a million tokens before the model has produced a single output character. Most of that prompt is redundant for most questions. Prompt caching mitigates this when the provider supports it, but only if you structure the request so the cacheable prefix is stable.

Retrieval bloat.A naive RAG setup pulls the top 10 chunks for every question because the developer was hedging against missing context. Each chunk is 400 to 800 tokens. That is 4,000 to 8,000 tokens of retrieved context per question, of which two or three chunks were actually relevant. The model still reads it all. Top-k tuning, reranking, and recall-vs-precision instrumentation pay for themselves quickly.

Embedding recomputation.Every time an ingest pipeline runs, it embeds the corpus again — sometimes the entire corpus, sometimes only what changed if the pipeline is careful. Embeddings are cheap per token but the corpora are large, and "we re-embed everything weekly" turns into real money on a 50 GB knowledge base.

Tool-call orchestration.An agentic workflow that fetches, retrieves, summarizes, drafts, and checks may make six to ten model calls per question. Each call carries its own system prompt and context. The total tokens for a single answered question can exceed 50,000 if no one is watching.

Reviewer re-runs.The proposal manager regenerates the draft because the tone is wrong. The SME re-asks a clarifying question. The deal lead pastes the answer into a separate ChatGPT to "improve" it. Each step duplicates input tokens. The economic unit is not "first draft generated" but "answer accepted and shipped," and the ratio between the two is often 3:1.

Long-context experiments.Someone decides to "just throw the whole questionnaire and the whole knowledge base into a long-context model and let it figure it out." It works in the demo. Then the bill arrives, because a 200K-token context window used at full size is expensive on every call.

Caching, and what each kind actually saves

"Just add caching" is the advice junior engineers give and senior engineers dread, because there are at least four distinct cache layers in a serious proposal pipeline and they save different money in different ways. Treating them as one thing causes teams to over-invest in the wrong layer.

Prompt caching at the model provider.Both Anthropic and OpenAI now offer prefix caching with discounted rates on the cached portion of the input. The pattern is simple: structure your request so the long, stable prefix — system prompt, style guide, persona, schema — comes first, and the variable portion — the actual question and the freshly retrieved passages — comes last. The cached prefix bills at a fraction of the standard input rate when it is hit. The trap is that any byte-level change to the prefix invalidates the cache. Teams that splice timestamps, request IDs, or per-user metadata into the system prompt get zero cache hits and never figure out why.

Embedding cache.Embeddings are deterministic for a given model and input. A simple keyed store — content hash to vector — saves the embedding cost on any unchanged chunk. This is straightforward and almost always underused. The savings show up at ingest time, not query time, but they compound quickly on large corpora that re-ingest on schedule.

Semantic answer cache.The cheapest call is the one you do not make. If a buyer asks "Where are your data centers located?" and your team has answered the same question 40 times in the last year with a stable answer, you should return the cached answer rather than re-generating it. Implementations vary — exact match, embedding-similarity lookup, hybrid of both — but the underlying insight is that most RFPs include 60 to 80 percent overlap with prior RFPs your team has already answered.

KV-cache and session reuse.Less applicable in stateless API workflows, but for self-hosted inference and multi-turn conversations within a single deal, reusing the KV cache across turns saves real compute. Most teams using hosted APIs will not touch this layer directly, but it shows up in provider pricing as session discounts where offered.

The discipline is to measure each cache's hit rate independently. A team that says "our cache hit rate is 60 percent" without specifying which cache has not actually measured anything. Prefix cache hit rate, embedding cache hit rate, and semantic answer cache hit rate are three different numbers, and the leverage from each is different.

Model routing: small for triage, frontier for synthesis

The largest single lever for most teams is not caching. It is using the right model for each subtask. A frontier model — the most capable, most expensive tier from any provider — is overkill for question classification, intent detection, simple lookups, or formatting transformations. A smaller, faster model handles these tasks for a fraction of the cost.

A practical routing pattern looks like this. A small model reads the incoming question and classifies it: is this a security question, a pricing question, a feature question, a reference question, or other? The same small model checks whether the answer is already in the curated answer library by semantic similarity. If a high-confidence match exists, the workflow returns the library answer without invoking any drafting model at all. If not, the small model decides which knowledge base sections to retrieve from. Only the final synthesis step — composing the answer from retrieved passages — uses a frontier model. And even synthesis can step down to a mid-tier model on standard questions, reserving the frontier tier for novel, multi-source, or compliance-critical answers.

The savings here are not marginal. A frontier-only pipeline that uses an Opus-class or GPT-5-class model for every step might cost two to four dollars per fully answered RFP question. A well-routed pipeline that reserves the frontier model for synthesis on the 20 percent of questions that need it can run at 40 to 60 cents per question. Multiply by an annual RFP volume of 12,000 questions across 60 RFPs and the difference is roughly the cost of a senior engineer.

The trap with routing is judging quality on each step individually rather than end-to-end. A cheap classifier that miscategorizes 8 percent of questions can push downstream costs up if the misclassification triggers expensive retrieval against the wrong index. Routing requires measurement: classification accuracy, retrieval relevance, and final answer acceptance must all be tracked together.

How a governed knowledge base cuts cost

Cost optimization conversations usually focus on tokens, models, and caching. The bigger lever — the one that surprises engineering leaders — is governance. A governed knowledge base, meaning a curated, deduplicated, source-of-truth answer library with version control and approval workflow, cuts cost in several ways simultaneously.

First, deduplication. Most enterprise knowledge bases are 60 to 75 percent redundant. Three product teams have each written their own version of "How do you handle data residency?" Two of them are stale. A governed library forces consolidation. After dedup, retrieval pulls fewer chunks because there is less to pull. Each query is shorter. Each cache hit is more likely.

Second, structured answers. A governed library doesn't only store source documents. It stores canonical, approved answers attached to questions, with citations back to source documents. When a buyer asks the same question, the library returns the approved answer directly — no retrieval, no generation, no spend. Generation is reserved for new questions or questions that need synthesis across multiple sources.

Third, freshness. Stale entries do not just produce wrong answers; they produce expensive rework. A wrong answer that ships in an RFP triggers a re-draft, a clarification email, sometimes a lost deal. A governance layer that flags stale entries and forces refresh keeps the corpus lean and accurate.

Fourth, retrieval precision. Approved answers tagged with topic, audience, and stage make retrieval surgical. The model receives 1,500 tokens of highly relevant context rather than 8,000 tokens of "maybe relevant."

The cumulative effect is striking. Teams that move from a raw RAG-over-everything setup to a governed answer library typically see input tokens per answered question fall by 60 to 80 percent and the fraction of questions answered without generation rise from near zero to 35 to 55 percent. The cheapest tokens are the ones never spent.

Measuring cost-per-response that you can defend

If you cannot measure it, you cannot optimize it. The right unit for proposal AI is cost per accepted answer — meaning the all-in token, embedding, and orchestration cost for every question that ultimately shipped in a delivered RFP, including rework. Tracking cost per API call is misleading because it ignores reruns, regenerations, and rejected drafts.

A defensible measurement stack has four layers. Per-request token attribution: every model call tagged with the workflow step, the question ID, the deal ID, and the model used. Cache instrumentation: hit and miss counters for each cache layer, surfaced as percentages, broken down by question category. Quality coupling: link each generated answer to its eventual disposition — accepted as drafted, edited then accepted, rejected and regenerated, abandoned. Cost dashboards rolled up by workflow, by team, by deal, with weekly trend lines.

Two anti-patterns to avoid. Do not report a single global "cost per query" number. It hides the variance that is the optimization opportunity. And do not optimize cost without coupling to quality. A 30 percent cost reduction that drops acceptance from 78 percent to 61 percent has cost you more in human time than it saved in API spend.

The ROI math for an enterprise RFP team

The honest ROI calculation has three components: direct API and infrastructure spend, human time, and outcome variance. Most cost analyses overweight the first and underweight the other two.

Direct spend on a poorly optimized pipeline for a mid-market enterprise running 60 RFPs and 200 security questionnaires per year typically lands at $80,000 to $200,000 annually in raw API costs. The same workload on a well-optimized pipeline — routed models, governed library, semantic answer cache, prompt caching — runs $20,000 to $50,000. The delta is real but it is not the headline number.

Human time is. A proposal analyst billing at fully loaded $130,000 spends, in the raw-LLM workflow, four to six hours per RFP correcting hallucinated answers, hunting for citations, and re-prompting the model. In a governed workflow that prevents hallucinations through source-backed answers and skips generation on repeat questions, the same analyst spends one to two hours per RFP on review and edit. Across 60 RFPs the time delta is roughly 180 to 240 analyst hours, which is a quarter of an FTE.

Outcome variance is the largest factor but the hardest to measure. A hallucinated answer in a regulated DDQ or a security questionnaire can cost a six-figure or seven-figure deal. Governance that prevents this is insurance whose premium is the cost of running the platform. The teams who calculate ROI honestly include the expected value of avoided incidents, not just the realized savings.

Raw LLM vs governed AI platform: where the money goes

The table below shows where cost differs between a workflow built directly on a model API and a workflow built on a governed AI platform with the optimizations above. Numbers are directional, drawn from common patterns in mid-market enterprise deployments; your mileage varies by volume, model choice, and corpus shape.

Comparison table

Cost dimension: Input tokens per answered question | Raw LLM workflow: 8,000–30,000 | Governed AI platform: 1,500–6,000

Cost dimension: Output tokens per answered question | Raw LLM workflow: 200–800 | Governed AI platform: 200–600

Cost dimension: Fraction of questions skipping generation (library hit) | Raw LLM workflow: 0–5 percent | Governed AI platform: 35–55 percent

Cost dimension: Prompt cache hit rate | Raw LLM workflow: 0–20 percent (unstructured prefixes) | Governed AI platform: 50–80 percent (stable prefix discipline)

Cost dimension: Embedding recomputation cadence | Raw LLM workflow: Full corpus on each refresh | Governed AI platform: Delta-only with content-hash cache

Cost dimension: Retrieval precision | Raw LLM workflow: Top-10 unfiltered chunks | Governed AI platform: Reranked, tagged, top-3 to top-5

Cost dimension: Hallucination rework cost (human hours / 60 RFPs) | Raw LLM workflow: 240–360 hours | Governed AI platform: 60–120 hours

Cost dimension: Audit and citation cost | Raw LLM workflow: Manual, often skipped | Governed AI platform: Inline, automatic

Cost dimension: Effective cost per accepted answer | Raw LLM workflow: $1.20–$3.50 | Governed AI platform: $0.20–$0.70

The table also reveals a non-obvious dynamic. A naive cost calculation focuses on the per-question line and concludes that the difference is "a few dollars." But the cost-per-accepted-answer line includes rework, and the gap there is closer to 5x than 2x. Governance is the multiplier.

Where Tribble fits

Tribble is an AI knowledge platform built for revenue teams that combines several of the cost-optimization patterns above into one system. The platform maintains a governed answer library with source citations on every answer, so common RFP, DDQ, and security questionnaire questions return approved answers without invoking a drafting model. Connectors to Salesforce, Gong, Slack, and document repositories keep the underlying corpus current without manual maintenance. The retrieval layer uses tagged, deduplicated content so each question pulls a tighter, more relevant context window rather than a wide top-k. Drafting is reserved for genuinely new questions and uses a routed model selection appropriate to the question class. Approval workflow, audit trail, and role-based access ensure that the answers feeding cost-saving reuse remain accurate enough to reuse. The net effect, for teams running heavy RFP and questionnaire workloads, is lower direct token spend and substantially lower human rework spend, with auditability that satisfies compliance review.

Frequently asked questions

A well-optimized pipeline lands somewhere between 4,000 and 12,000 input tokens per answered question, with output of 200 to 600 tokens. For a 200-question RFP that is roughly 1 to 2.5 million input tokens and 40,000 to 120,000 output tokens. The cost depends entirely on the model mix and cache hit rate; with prompt caching and small-model routing for triage, the all-in cost per RFP commonly runs $30 to $120. Unoptimized pipelines that route everything to a frontier model with no caching can easily run 5x to 10x that.

Smaller models are sufficient for question classification, intent detection, simple lookups, formatting transformations, and routing decisions. They are also sufficient for many standard, well-precedented answers where the supporting source content is unambiguous. Reserve a frontier model for synthesis across multiple sources, conflict resolution, novel questions, and any question where the cost of a subtly wrong answer is high. A useful test: if the same question were asked to a junior analyst with the right document open, would they need to think? If not, a small model is enough.

It pays back, but only with discipline. Both Anthropic and OpenAI now bill cached prefix tokens at a fraction of the standard rate, which is meaningful in retrieval-heavy workloads where the system prompt and instructions are stable across calls. The catch is that any change to the prefix — including injected timestamps, per-user identifiers, or reordered sections — invalidates the cache. Teams that achieve high hit rates structure their requests so the stable, cacheable content comes first and the variable content comes last. Teams that don't see savings usually have not separated the two.

Governance saves cost in two ways. It lets you skip generation entirely for repeat questions by returning approved answers from a curated library, and it lets retrieval be surgical because the underlying corpus is deduplicated and tagged. Quality goes up rather than down because the library entries have been reviewed and approved, source citations are required, and stale entries are flagged before they reach a draft. The trade-off is the upfront effort of curating the library and the ongoing discipline of approval workflow; the trade-off pays back quickly on any team running more than a handful of RFPs per quarter.

Track four metrics together: cost per accepted answer (not per call), cache hit rate broken down by cache layer, human hours per RFP from intake to ship, and answer acceptance rate by reviewers. Cost per accepted answer captures rework. Cache hit rate identifies which layer to invest in next. Human hours are the largest dollar component on most teams. Acceptance rate ensures cost reductions are not masking quality regressions. Report these together monthly. A single global "cost per query" number hides the variance you need to manage.

Almost always, yes. Long context is priced by what you actually send, not by the size of the window the model supports. Putting more content into context only helps when the additional content is relevant; if it is filler, you are paying for tokens the model dilutes its attention across. The right move is rarely "use a longer context model" and almost always "retrieve better." Long context is a tool for genuinely long documents, multi-document synthesis, and conversation history — not a substitute for retrieval discipline.