Enterprises pay for every token they send to a model, and most send far more than the task requires. System prompts, retrieved passages, conversation history, and tool schemas accumulate into context that is rarely audited and almost never optimized. This paper quantifies where context bloat comes from, models how it compounds as adoption scales, and shows what a gateway-level approach recovers in cost, latency, and control. It includes a worked monthly cost example, a measurement framework, and an implementation playbook.
- 1.Input tokens, not output, dominate cost in retrieval and agent workloads.
- 2.Context bloat is the sum of reasonable defaults that no one revisits after launch.
- 3.Cost compounds as usage scales and as agents fan a single request into many calls.
- 4.You cannot manage at the application layer a cost that is generated at the request layer.
- 5.A gateway makes compression consistent, measurable, and reversible, and gates it on evaluation so quality is never traded blindly.
- 6.The most valuable thing a gateway recovers for a regulated enterprise is a precise record of what was sent.
Executive summary
The completion a user reads is small. The context an enterprise sends to produce it is large, growing, and almost never measured. In retrieval-augmented and agentic workloads, the prompt is usually the dominant and faster-growing share of spend, yet it is the part no dashboard shows and no owner controls.
Context bloat is not one mistake. It is the sum of reasonable engineering defaults, generous retrieval, replayed history, verbose tool schemas, and prompts that only ever grow, that no one returns to trim once a feature ships. Multiplied by call volume and by the fan-out of agentic workflows, those defaults bend the cost curve the wrong way at exactly the moment leadership starts watching the AI line item.
This paper makes the economics concrete with a worked example, identifies the five places bloat accumulates, and argues that the durable fix lives at the gateway, where all context passes and where compression can be applied consistently, measured honestly, and reversed safely. The recovered asset is not only money. It is an auditable record of exactly what was sent to which model, and why.
I. The unit economics of context
Hosted language models bill by the token, and they bill for input as well as output. For conversational features the two are roughly balanced. For the workloads enterprises actually deploy at scale, retrieval-augmented generation and tool-using agents, the balance tips hard toward input. Every retrieved passage, every line of replayed history, and every tool definition is paid for on every call, whether or not the model needed it.
Input is the larger meter
Most teams reason about cost as though the answer is the expensive part. It rarely is. A grounded answer of two hundred tokens can ride on top of six thousand tokens of context. Because providers often price input below output per token, teams assume input is cheap and stop thinking about it. At volume, the cheaper per-token rate on a far larger token count is still the majority of the bill.
| Component | Tokens | Rate / 1K | Cost |
|---|---|---|---|
| System prompt + instructions | 900 | $0.0025 | $0.0023 |
| Few-shot examples | 1,200 | $0.0025 | $0.0030 |
| Retrieved passages (top-8) | 3,400 | $0.0025 | $0.0085 |
| Conversation history | 1,600 | $0.0025 | $0.0040 |
| Tool / function schemas | 1,100 | $0.0025 | $0.0028 |
| User message | 120 | $0.0025 | $0.0003 |
| Completion (output) | 210 | $0.0100 | $0.0021 |
| Total | 8,530 | n/a | $0.0230 |
In this example the user message and the completion together account for less than ten percent of the cost. The other ninety percent is context the application assembled and the team never sees on an invoice.
II. A worked example: one assistant, one month
Consider a single internal assistant handling forty thousand requests per month. The per-call breakdown above puts its run-rate near nine hundred dollars per month. That is unremarkable in isolation. The problem is that the same pattern repeats across every AI feature an enterprise ships, and each one carries its own quietly growing context.
| Context component | Tokens (before) | Tokens (after) | Monthly cost (before) | Monthly cost (after) |
|---|---|---|---|---|
| System + instructions | 900 | 420 | $90 | $42 |
| Few-shot examples | 1,200 | 300 | $120 | $30 |
| Retrieved passages | 3,400 | 1,200 | $340 | $120 |
| Conversation history | 1,600 | 450 | $160 | $45 |
| Tool schemas | 1,100 | 350 | $110 | $35 |
| Output (unchanged) | 210 | 210 | $84 | $84 |
| Total | 8,410 | 2,930 | $904 | $356 |
Compression here removes roughly sixty percent of context and a comparable share of cost, with no change to the model and, when gated by evaluation, no measurable change to answer quality. Across a portfolio of ten such features, the same discipline turns a six-figure annual surprise into a managed budget line.
III. Where the bloat accumulates
Bloat is rarely a single bad decision. It is an accumulation of defaults that were sensible when written and never revisited once the feature worked.
System prompts that only grow
Every production incident adds a sentence to the system prompt. None are ever removed. Over a quarter the prompt doubles, and a large share of it addresses edge cases that occur in a fraction of a percent of calls but is paid for on every call.
Retrieval that over-returns
Top-k is set generously during development to make demos work, then frozen. The model is handed eight passages when three would answer the question, and each passage is longer than it needs to be because chunking favored recall over precision.
Full history replay
Multi-turn features replay the entire conversation on every turn rather than summarizing it. By the tenth turn the history alone can exceed the rest of the prompt combined.
Verbose tool and function schemas
Agents are handed the full catalog of tools on every call, with complete JSON schemas, even when the step at hand can only use one of them. The schema tax is paid on every reasoning step of every agent run.
Duplicated instructions
The same guardrails appear in the system prompt, in a middleware wrapper, and in the retrieval template. Each was added by a different person solving a different bug, and together they triple the instruction budget.
IV. Why it compounds
Per-call bloat is multiplied by call volume, and both rise together as AI moves from a pilot to a platform. The pilot serves one team and a few thousand calls a month. The platform serves the organization, and the same oversized context now rides on every one of millions of calls.
Agentic fan-out
Agents change the arithmetic. A single user request no longer maps to one model call. It maps to a plan, a sequence of tool selections, several retrievals, and a synthesis, each its own call carrying the same context. A workflow that looked like one call in design becomes a dozen in production.
| Workload | Calls per action | Context reuse |
|---|---|---|
| Single-shot classification | 1 | Low |
| RAG question answering | 1–2 | Medium |
| Multi-turn assistant | 1 per turn | High and growing |
| Tool-using agent | 5–20 | High, repeated |
| Multi-agent workflow | 20–100+ | Very high, repeated |
The result is a cost curve that bends the wrong way precisely as leadership starts paying attention to the AI line item, and it bends fastest for the agentic workloads everyone is racing to deploy.
V. The measurement gap
The reason bloat persists is that almost no one measures it. Provider dashboards report aggregate tokens and aggregate spend. They do not attribute cost to a workflow, decompose a prompt into its parts, or tell you which part grew. The most consequential and fastest-growing cost in the enterprise is also the least instrumented.
Teams rarely have answers to the questions that would let them act:
- How many tokens does each workflow send per request, and how has that changed month over month?
- What share of each prompt is system instruction, retrieval, history, and tool schema?
- Which workflow, team, or customer is responsible for which share of total spend?
- What would the same workload cost with the context it actually needs, rather than the context it currently sends?
You cannot reduce on purpose what you do not measure. The first move is not compression. It is attribution: decomposing every request into its components at the point it is made.
VI. What teams try, and why it falls short
Faced with a rising bill, teams reach for the tactics nearest to hand. Each helps a little, and none holds.
- Hand-trimming prompts. A one-time saving that drifts back as new edge cases are patched in, with no measurement of the quality impact.
- Switching to a cheaper model. Useful, but it treats the symptom and often trades away the quality the workload was chosen for.
- Bolting on a cache. Valuable where inputs repeat exactly, which in open-ended enterprise workloads is the minority of traffic.
- Cutting top-k blindly. Saves tokens until it quietly drops the passage that held the answer, and no one notices until a user does.
The common flaw is structural. These fixes are applied per application, by whoever has time, with no shared measurement and no record of what changed. They cannot compound because nothing remembers them.
You cannot manage at the application layer a cost that is generated at the request layer.
VII. The gateway approach
Context is best optimized where all of it passes, at a gateway between applications and models. Applied centrally, compression becomes consistent across every workflow, measurable on every request, and reversible by configuration rather than by code change.
Summarize history
Replace verbatim replay with a running summary that preserves the facts the next turn needs and drops the turns it does not. History stops growing without bound.
Prune and rerank retrieval
Rerank retrieved passages and keep only those that carry signal for the current question, trimming each to the span that matters. Recall is preserved where it counts and paid for nowhere else.
Normalize and dedupe instructions
Collapse duplicated guardrails into a single canonical instruction set, and remove the sediment of past incidents that no longer applies.
Trim tool schemas to the task
Hand an agent only the tools the current step can use, with schemas scoped to what it needs, rather than the entire catalog on every reasoning step.
Every reduction is a hypothesis about what the task does not need. It must be tested against a held-out evaluation set before it ships, and promoted only when quality is preserved. Compression without an evaluation gate is not savings. It is deferred risk.
VIII. Measuring compression honestly
A credible compression program reports three things on every change: how many input tokens were removed, what that saved, and whether quality held. The third is the one most programs skip and the one that makes the rest trustworthy.
| Metric | Baseline | Candidate | Change |
|---|---|---|---|
| Input tokens / request | 8,410 | 2,930 | −65% |
| Cost / request | $0.0226 | $0.0089 | −61% |
| p95 latency | 1,980 ms | 1,410 ms | −29% |
| Eval pass rate (held-out) | 96.0% | 96.2% | +0.2 pts |
| Decision | n/a | n/a | Promote |
A change that cut cost but moved the eval pass rate down would not promote. Cost control is treated as a quality-constrained optimization, not a race to the smallest prompt.
IX. An implementation playbook
The path from a rising bill to a managed one is sequential, and each step de-risks the next.
- 1Observe only. Route traffic through the gateway and change nothing. Establish a per-workflow baseline of tokens, cost, latency, and prompt composition.
- 2Attribute. Tag every request to a workflow, team, and customer. Decompose each prompt into system, retrieval, history, and schema shares.
- 3Find the contributors. Rank workflows by spend and by context growth. The largest few almost always account for most of the opportunity.
- 4Compress behind a gate. For the top contributors, apply summarization, retrieval pruning, instruction dedupe, and schema trimming as candidate configurations, scored against a held-out evaluation set.
- 5Promote and monitor. Promote only changes that preserve quality. Keep the before-and-after on record, and watch for drift as documents and usage change.
None of these steps requires rewriting an application. Adoption is a configuration change, pointing a base URL at the gateway, and value arrives before any code is touched.
X. What it recovers
A context gateway recovers three things at once.
For a regulated enterprise the third is often the most valuable. The ability to show, after the fact, exactly what context reached which model on which request turns AI from an unexplained expense into a governed system. Cost control becomes a property of the architecture rather than a periodic clean-up.
XI. Conclusion
Context bloat is invisible by construction. It hides in the part of the prompt no one reads, grows through defaults no one revisits, and compounds through the agentic workloads everyone is deploying. The fix is not heroics in each application. It is a single place where all context passes, where it can be measured, compressed, and proven, on every request. The enterprises that put that layer in early will spend less and, more importantly, will be able to say exactly what their AI did.
References
- [1]Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020.
- [2]FinOps Foundation, FinOps Framework: Principles, Domains, and Capabilities.
- [3]Microsoft, Azure OpenAI Service pricing and quota documentation.
- [4]OpenAI, Models and pricing reference.
- [5]Ryshe, The Quanta Spec-Review Benchmark v1: Methodology and Scoring.
