Most prompts sent to a language model carry far more context than the task in front of it requires. Compression removes the surplus, but done carelessly it removes the answer along with the noise. This paper treats compression as a quality-constrained optimization rather than a race to the smallest prompt. It gives a taxonomy of methods, a comparison of their reductions and risks, an evaluation gate that decides what ships, an honest before-and-after measurement template, the failure modes that catch teams out, a tuning loop, a per-workload strategy, and the cases where the right amount of compression is none.
- 1.The objective is the smallest context that preserves quality, not the smallest context. The two are different targets and only one of them is safe.
- 2.Compression methods divide into lossless reshaping, which carries no quality risk, and lossy selection, which carries the real risk and the real savings.
- 3.Nothing promotes without passing a held-out evaluation set. The eval gate is what separates a measured reduction from a quiet regression.
- 4.Report compression honestly: tokens removed, cost saved, latency change, and the eval pass rate side by side, with an explicit promote or hold decision.
- 5.The dangerous failures are silent: the dropped passage, the summarized-away constraint, the over-pruned multi-hop question, the biased reranker.
- 6.Different workloads tolerate different compression. A FAQ assistant and a contract-review pipeline do not get the same profile.
- 7.Some context should not be compressed at all. Low volume, legally required verbatim text, and high-stakes single-shot decisions are cases where restraint is the correct setting.
Executive summary
A model is handed a large context and uses a small part of it. The retrieved passage that holds the answer is in there, surrounded by passages that are not. The one constraint that governs the response is in there, surrounded by instructions that the current question never touches. Compression is the practice of finding and keeping the part that carries the answer and removing the rest. The goal is not the smallest possible context. It is the smallest context that still holds the quality.
That distinction is the whole discipline. A compression that cuts cost but drops the passage the answer depended on has not saved anything. It has moved the loss from the invoice, where it was visible, to the answer, where it is not. Every reduction is a hypothesis about what the task does not need, and a hypothesis is only worth shipping once it has been tested against examples where the right answer is known.
This paper is the engineering counterpart to the cost argument. It assumes the savings are real and asks the harder question: how do you take them without paying in quality you cannot see. The answer is a measured loop. Baseline against a held-out evaluation set, propose compression profiles, score each one, promote only the profiles that hold quality, and watch for drift after they ship. Compression that runs through that loop is a managed reduction. Compression that skips it is deferred risk wearing the costume of a cost win.
I. The compression problem
Start with an honest description of what a model receives. A grounded answer of two hundred tokens rides on a context of several thousand: a system prompt, a few examples, a handful of retrieved passages, the conversation so far, and the schemas of every tool the agent might call. The task at hand, answering one question, needs a fraction of that. The surplus is paid for on every call and, more quietly, it dilutes the signal the model has to find.
Surplus is the default, not the exception
Context grows by accretion. Retrieval is tuned for recall during development and frozen. System prompts gain a sentence per incident and lose none. History replays in full because that was the simplest thing to build. None of these choices was wrong when made. Together they produce a prompt where the share that matters to the current question is often a minority of the tokens sent.
The smallest context is the wrong target
It is tempting to frame compression as minimization: send as few tokens as the model will tolerate. That target is wrong because it has no floor on quality. Pushed far enough it removes the passage the answer needed, the constraint that kept the response safe, or the example that taught the format. The correct target is constrained minimization. Find the smallest context for which a held-out evaluation set still passes at the baseline rate. The constraint is what makes the reduction safe.
Compression is a hypothesis, not an edit
Each thing you remove encodes a claim: the task does not need this. Some of those claims are obviously true, such as removing trailing whitespace. Some are plausible but testable, such as keeping the top four reranked passages instead of eight. Treating every removal as a hypothesis to be tested, rather than an edit to be made, is the shift in stance that this paper is built around.
II. A taxonomy of methods
Compression is not one technique. It is a family, and the members differ sharply in how much they save and how much they can hurt. The useful first cut is whether a method changes the information available to the model or only its packaging.
Conversation history summarization
Replace verbatim replay of every prior turn with a running summary that preserves the facts later turns need and drops the turns they do not. This stops history from growing without bound on long sessions. The risk is that a summary can quietly drop a detail a much later turn depended on, so the summary must be tested on multi-turn examples, not single-turn ones.
Retrieval pruning and reranking
Retrieve generously, then rerank the candidates by relevance to the current question and keep only those that carry signal, trimming each to the span that matters. This is usually the largest single source of savings because retrieved passages are usually the largest single share of context. The risk is dropping the one passage that held the answer, which a good reranker minimizes but never fully removes.
Instruction deduplication and normalization
The same guardrail often appears in the system prompt, a middleware wrapper, and a retrieval template, each added by a different person solving a different bug. Collapse the duplicates into one canonical instruction set and remove guidance that no longer applies. Done carefully this is close to lossless, because identical instructions carry no extra information.
Tool and function schema trimming
Agents are commonly handed the full tool catalog with complete JSON schemas on every reasoning step. Scope the tools to the ones the current step can actually use, and trim each schema to the fields the step needs. The risk is removing a tool the agent would have selected, so scoping must be driven by the plan, not by a static guess.
Semantic deduplication
Retrieved passages frequently overlap, restating the same fact from two near-identical sources. Detect passages that are semantically redundant and keep one representative. This preserves the information while removing the repetition. The risk is treating two passages as duplicates when they differ in a detail that mattered, so the similarity threshold has to be conservative.
Format and whitespace normalization
Collapse redundant whitespace, normalize markdown and JSON formatting, and strip boilerplate that carries no signal. This is the safest family because it changes packaging, not content. The savings are modest but the risk is close to zero, which makes it the right place to start.
The lossless versus lossy distinction
Normalization and exact deduplication are lossless: the information the model can act on is unchanged, only its encoding is smaller. Summarization, retrieval pruning, semantic deduplication, and schema trimming are lossy: they remove information on the bet that it was not needed. Lossless methods can ship on inspection. Lossy methods must pass the evaluation gate. Keeping the two categories separate is what lets a team move quickly on the safe reductions and carefully on the risky ones.
III. A methods comparison
The methods differ along four axes that decide how to treat each one: the mechanism, the reduction it typically yields, the primary risk it carries, and whether it can be reversed if it misfires. The table below is the field guide.
| Method | Mechanism | Typical reduction | Primary risk | Reversibility |
|---|---|---|---|---|
| History summarization | Running summary replaces verbatim turns | 30–60% | Drops a detail a later turn needs | Full (keep raw turns) |
| Retrieval pruning + rerank | Score relevance, keep top spans | 40–70% | Drops the answer-bearing passage | Full (re-retrieve) |
| Instruction dedupe | Collapse to one canonical set | 10–25% | Removes a still-needed guardrail | Full (versioned prompt) |
| Schema trimming | Scope tools to the current step | 20–50% | Hides a tool the agent would pick | Full (re-expand) |
| Semantic dedupe | Keep one of near-duplicate passages | 10–30% | Merges passages that differed | Full (re-retrieve) |
| Format normalization | Collapse whitespace and boilerplate | 5–15% | Minimal | Full (lossless) |
Reversibility is the column teams undervalue. Because every method here can be reversed by configuration rather than code, a compression profile is a setting that can be rolled back the moment a metric moves, not a rewrite that has to be undone by hand. That property is what makes aggressive experimentation safe.
IV. The evaluation gate
Compression is only as trustworthy as the gate it has to pass. Without an evaluation set, a reduction is a guess that looks like a saving until a user finds the regression. The gate turns compression from a hope into a measurement.
Held-out evaluation sets
Build a set of representative examples for each workload where the correct outcome is known: the right answer, the right verdict, the right citation, or the right refusal. Hold it out from any tuning so it cannot be fitted to. A candidate compression profile passes only if the workload scores at or above its uncompressed baseline on this set. The set is the contract: compression may change anything except the score.
Metrics that protect different things
A single accuracy number hides the failures that matter most. A compression that drops a citation can leave overall accuracy flat while quietly breaking groundedness. The gate therefore tracks several metrics, each guarding a distinct property.
| Metric | What it measures | What it protects against | Target |
|---|---|---|---|
| Task success / verdict accuracy | Correct final answer or decision | Pruning away the answer-bearing context | >= baseline |
| Citation / groundedness | Claims trace to provided sources | Keeping the answer but losing its support | >= baseline |
| Refusal correctness | Refuses when it should, answers when it should | Summarizing away a safety constraint | >= baseline |
| Format adherence | Output matches required structure | Trimming the example that taught the format | >= baseline |
The gate is conjunctive. A profile must hold every metric, not the average. A change that lifts task success while dropping refusal correctness has not made the system better. It has traded a visible win for an invisible liability, which is exactly the trade the gate exists to refuse.
V. Measuring compression honestly
An honest compression report puts the savings and the quality on the same page and lets the quality decide. Three numbers describe the saving: input tokens removed, cost saved, latency change. One number describes the cost of the saving: the eval pass rate. The decision follows from reading them together.
| Metric | Baseline | Candidate | Change |
|---|---|---|---|
| Input tokens / request | 8,410 | 3,050 | -64% |
| Cost / request | $0.0226 | $0.0093 | -59% |
| p95 latency | 1,980 ms | 1,420 ms | -28% |
| Eval pass rate (held-out) | 96.0% | 96.1% | +0.1 pts |
| Citation accuracy | 94.0% | 94.0% | 0.0 pts |
| Refusal correctness | 99.0% | 99.0% | 0.0 pts |
| Decision | n/a | n/a | Promote |
The decision row is the point
The table earns its keep on the last row. This candidate promotes because it removed nearly two thirds of the input while every quality metric held within noise. A second candidate that cut tokens by eighty percent but moved citation accuracy down two points would hold, not promote, no matter how attractive the cost line looked. The decision is a function of the quality metrics first and the savings second, never the other way around.
Report holds, not just promotions
A program that only publishes its wins is not measuring honestly. The profiles that failed the gate are evidence too: they mark where the workload's real floor is and stop the team from re-proposing the same over-compression next quarter. Keeping the holds on record is what makes the next round of tuning faster and the whole program credible.
VI. Failure modes and guardrails
The failures that matter are the ones that do not announce themselves. A crash gets fixed. A compression that returns a confident, well-formatted, wrong answer ships and survives, because nothing in the response signals that the context it needed was missing.
Dropping the passage that held the answer: aggressive pruning removes the one chunk the response depended on, and the model answers from prior knowledge or hallucinates. Summarizing away a hard constraint: a history or instruction summary smooths over a 'must not' that governed the task, and the model stops honoring it. Over-pruning multi-hop questions: a question that needs three passages to chain an answer is given one, and the chain silently breaks. Reranker bias: the reranker systematically favors one document style, source, or recency, and quietly starves a whole class of questions of the context they need. None of these throws an error. Only the evaluation set catches them.
Guardrails that catch each one
Each silent failure has a specific defense. The dropped passage is caught by citation and groundedness metrics on a set that includes hard-to-retrieve answers. The lost constraint is caught by refusal-correctness examples that depend on the constraint being present. Broken multi-hop is caught by including genuinely multi-hop questions in the eval set rather than only single-fact lookups. Reranker bias is caught by stratifying the eval set across source types and recency and watching for a metric that drops on one stratum while the average holds.
Stratify, do not average
The connective tissue across all four is the same warning the metrics table made: averages hide the harm. A profile that holds the mean while collapsing on a minority of questions has a bias problem that an aggregate number will never surface. The eval set has to be stratified along the axes a failure would travel, and each stratum has to clear the gate on its own.
VII. The tuning loop
Compression is not a one-time edit. Documents change, usage shifts, and a profile that was safe last quarter can drift out of bounds. The work is a loop, and the loop is what keeps the reduction safe over time.
- 1Baseline. Run the workload uncompressed against the held-out evaluation set and record the pass rate for every metric, plus tokens, cost, and latency. This is the floor every candidate is measured against.
- 2Generate candidate compression profiles. Combine methods into named profiles at a few intensities, for example a conservative profile that only normalizes and dedupes, and a moderate profile that adds retrieval pruning and history summarization.
- 3Score against the eval set. Run each profile through the conjunctive gate. Record both the savings and every quality metric, including the strata, so a profile that fails on one class of question is caught.
- 4Promote winners. Promote only profiles that hold every metric at or above baseline. Keep the holds on record with the reason they failed, so the floor is documented.
- 5Monitor for drift. Re-score promoted profiles on a schedule and on triggers such as a corpus update or a retrieval change. When a metric drifts below the gate, demote the profile automatically and fall back to a safer one.
The loop has no end state, and that is by design. A promoted profile is a current best, not a final answer, and the monitoring step is what turns compression from a project that finishes into a property the system maintains.
VIII. Per-workload strategy
There is no single correct compression profile, because workloads differ in how much surplus they carry and how badly a wrong removal hurts. A factual FAQ tolerates aggressive pruning. A contract-review pipeline does not. Matching the profile to the workload is most of the skill.
| Workload type | Safe methods | Caution |
|---|---|---|
| FAQ / factual lookup | Pruning, dedupe, normalization | Low risk, compress freely behind the gate |
| Multi-turn assistant | History summarization, normalization | Test on long sessions, not single turns |
| RAG over large corpus | Rerank, prune, semantic dedupe | Stratify eval by source and recency |
| Multi-hop research | Light pruning, normalization | Keep enough passages to chain; do not over-prune |
| Tool-using agent | Schema trimming, history summary | Scope tools from the plan, not a static guess |
| Contract / spec review | Normalization, exact dedupe only | Lossy selection risks dropping a binding clause |
Read the table as a starting posture, not a verdict. The eval gate still decides what ships for each workload. What the table provides is the prior: where to begin, and where to keep a hand on the brake. A profile that is safe for the FAQ assistant is not authorized for the contract reviewer until it has cleared the contract reviewer's own evaluation set.
IX. When not to compress
The discipline of compression includes knowing when the right setting is off. Three cases recur where the cost of a wrong removal outweighs any saving, and where restraint is the correct engineering choice rather than a missed opportunity.
Low volume
Compression has a fixed cost: building the eval set, tuning the profiles, and monitoring for drift. For a workflow that runs a few hundred times a month, that cost will never be repaid by the token savings. Spend the effort where the volume is, and leave the long tail uncompressed.
Legally required verbatim context
Some context must reach the model exactly as written: regulated disclosures, contractual language, statutory text, or evidence that has to be reproduced word for word. Summarizing or pruning it does not just risk quality. It can break a compliance requirement. These passages are marked as verbatim and excluded from every lossy method by policy.
High-stakes single-shot decisions
When a single call drives a consequential, hard-to-reverse decision and there is no second turn to recover a missing detail, the value of the surplus context is asymmetric. The few cents saved by trimming it are trivial next to the cost of the one decision that the trimmed passage would have changed. Here the correct profile is conservative or none, and the gate is set tighter than for routine traffic.
Restraint is a setting, not a failure
Declining to compress a workload is not the program falling short. It is the program working. A compression discipline that can articulate where it does not apply is more trustworthy than one that applies everywhere, because it shows the team is optimizing for quality first and savings second, which is the only ordering that holds up.
X. Conclusion
Compression is the difference between a prompt that carries the task and a prompt that carries the task plus everything no one trimmed. Taken carelessly it trades a visible cost for an invisible one. Taken as a measured loop, baseline, propose, score, promote, monitor, it removes the surplus and keeps the quality, and it can prove that it did. The methods are well understood. What separates a reduction that holds from one that quietly regresses is the gate in front of it and the honesty of the report behind it. Build those, treat every removal as a hypothesis, and compression becomes what it should be: the smallest context that still holds the answer, and not one token smaller.
References
- [1]Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020.
- [2]Jiang et al., LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, EMNLP 2023.
- [3]Khattab and Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020.
- [4]Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023.
- [5]Ryshe, The Quanta Spec-Review Benchmark v1: Methodology and Scoring.
