Most enterprises now have a token dashboard. Very few have AI FinOps. A dashboard reports what was spent after it was spent, in an aggregate that no team can act on. AI FinOps is a discipline: it attributes spend to the workflows and customers that caused it, forecasts where the curve is heading, applies reduction levers behind a quality gate, and assigns ownership across Finance, Engineering, Governance, and Product. This paper adapts the FinOps inform, optimize, and operate cycle to large language model spend, defines cost per successful outcome as the metric that matters, and gives a 90-day rollout. It is written for the CIO and CFO who have seen the AI line item and now need a way to govern it.
- 1.A token dashboard reports cost, it does not attribute or control it, and reporting is the easy part.
- 2.AI FinOps adapts the FinOps inform, optimize, and operate cycle to per-request LLM spend.
- 3.Attribution replaces an aggregate invoice with an itemized ledger tied to workflow, team, and customer.
- 4.Spend forecasts well as calls times context size times rate, which makes growth scenarios testable before they arrive.
- 5.Reduction levers, compression, routing, caching, and policy, only count when gated on evaluation and measured in cost per successful task.
- 6.Unit economics, cost per successful outcome, is the metric that ties AI spend to business value and the only one a CFO can defend.
- 7.AI spend needs a named owner and a RACI, or it stays everyone's concern and no one's responsibility.
Executive summary
A spend chart is not a control. It explains the past, it does not change it, and it almost never tells you which workflow, team, or customer caused the number. Enterprises have responded to rising AI bills by buying visibility, then discovering that visibility alone moves nothing. The chart goes up, everyone agrees it should go down, and no one owns the lever that would bend it.
AI FinOps is the practice that turns a dashboard into a decision. It borrows the FinOps cycle the industry already uses for cloud, inform, optimize, and operate, and applies it to the one cost center that grows per request rather than per server: tokens sent to and received from a model. The unit of account is not the token. It is the successful business outcome the tokens were spent to produce.
This paper draws a hard line between observation and control. It shows how to replace an aggregate invoice with an itemized ledger, how to forecast spend as a product of calls, context size, and rate, and how to manage AI cost as a quality-constrained optimization rather than a race to the cheapest prompt. The result is an operating model in which AI spend has an owner, a forecast, a set of levers, and a definition of success that Finance and Engineering can both sign.
I. The dashboard trap
The first thing an enterprise builds when the AI bill arrives is a dashboard. It is the right instinct and the wrong stopping point. A dashboard answers one question, how much did we spend, and it answers it well. It does not answer the three questions that would let anyone act: who caused it, where is it heading, and what do we do about it.
Spend charts explain the past
A monthly spend chart is a record of decisions already made and tokens already billed. By the time a number appears on it, the workflows that produced the number have run, the context they sent is gone, and the only thing left is the invoice. You can study the chart as long as you like. It describes a world you can no longer change.
Aggregation hides the cause
Provider and cloud billing report spend in aggregate: total tokens, total dollars, perhaps a split by model. They do not tell you that one customer's nightly batch job is forty percent of the bill, or that a single agent workflow fanned a thousand user actions into thirty thousand model calls. The cost that matters is always a particular workflow serving a particular user, and aggregation is designed to erase exactly that detail.
Visibility is not control
The gap between seeing a cost and changing it is the whole problem. A dashboard is a thermometer. AI FinOps is the thermostat. The rest of this paper is about the difference: attribution that names the cause, forecasting that anticipates the curve, levers that bend it, and an owner who is accountable for the result.
II. What AI FinOps actually is
FinOps is a discipline the industry built for cloud, where elastic resources made spend a real-time engineering decision rather than a procurement event. Its cycle is three phases that repeat: inform, optimize, and operate. AI spend has the same shape as cloud spend, variable, demand-driven, and generated by engineering choices, so the same cycle applies. What changes is the unit being managed.
Inform
Make spend visible and attributable. For AI this means decomposing every request into its components, system prompt, retrieval, history, tool schemas, and output, and tagging it to a workflow, team, and customer. Inform is where a dashboard stops and AI FinOps begins, because inform demands attribution, not just a total.
Optimize
Reduce cost without trading away the outcome. The levers are compression, model routing, caching, and policy, covered in Section V. The constraint that separates optimization from cost-cutting is quality: every reduction is a hypothesis tested against an evaluation set before it ships.
Operate
Run the practice continuously. Set budgets and forecasts, assign ownership, define KPIs and SLOs for cost, and review them on a cadence. Operate is what keeps the savings from drifting back, because nothing about AI spend stays fixed once usage and documents change.
Cloud FinOps manages cost per unit of work served. AI FinOps manages cost per successful outcome: the answer a user accepted, the ticket resolved, the document correctly extracted. Token count is an input to that number, not the number itself.
III. Attribution: from invoice to ledger
Attribution is the foundation, because every other capability depends on it. You cannot forecast a workflow you cannot isolate, cannot apply a lever to a cost you cannot name, and cannot assign ownership for a number no one can decompose. The move is to replace one aggregate invoice with a ledger that itemizes spend along the dimensions a business actually manages: workflow, team, customer, and request.
Tag at the point of the call
Attribution has to happen where the request is made, because that is the only place the context still exists. A gateway between applications and models sees every call with its full composition and its calling identity, and can stamp each one with the workflow that issued it, the team that owns that workflow, and the customer it served. Stamped at the source, the same data rolls up into any view Finance or Engineering needs.
| Workflow | Team | Calls | Avg input tokens | Share of spend | Monthly cost |
|---|---|---|---|---|---|
| Support copilot | CX | 1,240,000 | 6,800 | 38% | $41,800 |
| Contract extraction | Legal Ops | 210,000 | 12,400 | 22% | $24,200 |
| Sales research agent | Revenue | 96,000 | 9,100 | 16% | $17,600 |
| Internal knowledge bot | IT | 880,000 | 3,200 | 14% | $15,400 |
| Marketing drafting | Brand | 140,000 | 2,400 | 6% | $6,600 |
| Everything else | Various | 320,000 | 1,900 | 4% | $4,400 |
| Total | n/a | 2,886,000 | n/a | 100% | $110,000 |
The aggregate version of this table is a single cell: one hundred ten thousand dollars. The ledger version is a set of decisions. The support copilot is the obvious target for compression, contract extraction is the highest cost per call and worth a routing review, and marketing drafting is too small to spend an engineer's week on. None of that is visible on an invoice.
IV. Forecasting: modeling the curve before it arrives
A forecast is what turns AI spend from a monthly surprise into a planned budget line. AI spend forecasts unusually well because its drivers are explicit. For any workflow, monthly cost is approximately calls times average context size times rate, summed over input and output. Each of the three is something the business can estimate and the gateway can measure.
The three drivers
- Calls: the number of model invocations, which for agentic workflows is user actions multiplied by fan-out, not user actions alone.
- Context size: average input plus output tokens per call, the lever most under engineering control and the one that grows silently.
- Rate: provider price per thousand tokens, which differs by model and by input versus output, and which routing decisions can change.
A worked projection
Take the support copilot from Section III at its current run-rate and project a quarter forward under three growth scenarios. The point is not the precise dollar figure. It is that a model built from drivers lets you see the curve, and the effect of a lever, before the invoice confirms it.
| Scenario | Monthly calls | Avg tokens / call | Blended rate / 1K | Projected monthly cost | vs today |
|---|---|---|---|---|---|
| Today (baseline) | 1,240,000 | 7,010 | $0.0048 | $41,800 | 0% |
| Flat usage, no action | 1,240,000 | 7,010 | $0.0048 | $41,800 | 0% |
| Adoption +40% | 1,736,000 | 7,010 | $0.0048 | $58,500 | +40% |
| Adoption +40%, context unmanaged | 1,736,000 | 9,100 | $0.0048 | $75,900 | +82% |
| Adoption +40%, levers applied | 1,736,000 | 3,900 | $0.0042 | $28,400 | -32% |
| Total addressable swing | n/a | n/a | n/a | $47,500 | n/a |
The two scenarios that matter sit next to each other. Adoption rises forty percent either way. In one the context drifts up unmanaged and spend nearly doubles. In the other the same adoption is met with levers and spend falls. The difference between those two outcomes, roughly forty seven thousand dollars a month for one workflow, is the entire value of running AI FinOps instead of watching a chart.
V. Active reduction levers
Forecasting tells you where the curve is heading. Levers change it. There are four that matter, and they compose: each addresses a different part of the cost equation, and a mature practice runs all four behind a single evaluation gate.
| Lever | Mechanism | Typical saving | Primary risk |
|---|---|---|---|
| Compression | Summarize history, prune and rerank retrieval, dedupe instructions, trim tool schemas | 30–60% of input | Dropping context the task needed |
| Model routing | Send easy requests to a cheaper model, reserve the frontier model for hard ones | 20–50% of spend | Quality regression on misrouted requests |
| Caching | Reuse responses or prefixes for repeated or near-repeated inputs | 10–40% where inputs repeat | Stale answers, low hit rate on open-ended traffic |
| Policy | Enforce per-workflow token budgets, block known-wasteful patterns at the gateway | 5–20% of spend | Over-restriction that breaks a legitimate workflow |
Levers compose, savings do not simply add
Applied together the levers interact. Compression shrinks the context that routing then prices at a lower rate, and policy caps the outliers that caching cannot help. The combined effect is usually less than the naive sum of each lever's headline number and far more than any one of them alone. The discipline is to measure the combination, not to claim the sum.
Every lever is a hypothesis
Each lever is a bet that the workflow does not need what the lever removes or reroutes. That bet can be wrong, and the only way to know is to test it. Which is the subject of the next section.
VI. Quality as a budget constraint
The fastest way to cut AI spend is to cut quality, and it is almost always the wrong trade. A prompt stripped to nothing is cheap and useless. The number that protects against this is cost per successful task, not cost per token, and the mechanism that enforces it is the evaluation gate.
The eval gate
Before any lever ships, it runs against a held-out evaluation set that scores the outcome the workflow exists to produce: the correct extraction, the accepted answer, the resolved ticket. A candidate configuration promotes only if it preserves the pass rate. A change that saves money and lowers the pass rate does not promote, however large the saving. Cost control is run as a quality-constrained optimization, with quality as the constraint and cost as the objective.
| Configuration | Cost / call | Eval pass rate | Cost / successful task | Decision |
|---|---|---|---|---|
| Baseline | $0.0336 | 95.0% | $0.0354 | n/a |
| Aggressive compression | $0.0121 | 88.0% | $0.0138 | Reject |
| Gated compression + routing | $0.0149 | 95.4% | $0.0156 | Promote |
The aggressive configuration is cheaper per call and looks like the winner on a token dashboard. Judged on cost per successful task it is only a little cheaper than the gated option, and it sheds seven points of quality to get there. The gated configuration wins on the only metric that accounts for both sides of the trade.
VII. The unit economics of AI
Cloud taught Finance to ask for cost per unit: per order, per active user, per gigabyte served. AI needs the same discipline, and its natural unit is the successful outcome. Cost per successful outcome is total AI spend for a workflow divided by the number of successful results it produced, where success is defined by the business, not by the model.
Defining the outcome
The outcome has to be something the business already values and can count. For a support copilot it is a resolved conversation. For contract extraction it is a document processed without human correction. For a sales agent it is a qualified research brief a rep used. The definition is a joint decision of Product and Finance, and it is the most important number in the practice because everything else is measured against it.
A workflow whose token spend doubled but whose cost per successful outcome fell is succeeding: it is doing more valuable work more efficiently. A workflow whose spend held flat while its cost per successful outcome rose is failing quietly. Volume metrics cannot tell these two apart. Unit economics can.
Once cost per successful outcome exists, the AI line item stops being an unexplained expense and becomes an investment with a return. A CFO can compare it to the cost of the human process it replaces, set a target, and hold the workflow to it. That comparison is the conversation AI FinOps exists to enable.
VIII. The operating model: who owns AI spend
A cost with no owner is a cost no one controls. The most common failure in enterprise AI is not technical: it is that AI spend belongs to everyone and therefore to no one. Engineering builds the workflows, Finance receives the bill, Governance worries about the risk, and Product sets the roadmap, and the spend falls through the gap between them. AI FinOps closes that gap with an explicit operating model.
| Activity | Finance | Engineering | Governance | Product |
|---|---|---|---|---|
| Set AI budget and forecast | A | C | I | C |
| Attribute spend to workflows | C | R | I | I |
| Apply reduction levers | I | A | C | C |
| Define successful outcome | C | C | I | A |
| Run the evaluation gate | I | A | C | C |
| Set cost policy and limits | C | R | A | C |
| Review KPIs and SLOs | A | R | C | C |
The pattern that matters is single accountability per row. Finance is accountable for the budget, Engineering for the levers and the gate, Governance for policy, and Product for the definition of success. Everyone is consulted or informed where they have a stake, but exactly one function answers for each outcome. A practice without this table tends to relitigate ownership every time the bill moves.
IX. KPIs and SLOs for AI cost
What gets reviewed gets managed. AI FinOps runs on a small set of metrics with target ranges, reviewed on a cadence by the owners in the RACI. The targets below are starting points for a maturing practice, not measured results, and each enterprise should set its own from its baseline.
| Metric | Definition | Target range | Review cadence |
|---|---|---|---|
| Cost per successful outcome | Workflow spend / successful results | Flat or down quarter on quarter | Monthly |
| Attribution coverage | Share of spend tagged to a workflow | Above 95% | Monthly |
| Forecast accuracy | Actual vs projected monthly spend | Within 10% | Monthly |
| Input share of spend | Input cost / total cost per workflow | Tracked, no hard cap | Monthly |
| Eval pass rate at promotion | Quality on held-out set when a lever ships | At or above baseline | Per change |
| Budget variance | Actual spend vs approved budget | Within 5% | Monthly |
| Cache hit rate | Served from cache / cacheable requests | Workload dependent | Monthly |
The two non-negotiable rows are cost per successful outcome and eval pass rate at promotion. Together they guarantee that the practice never buys a lower cost by quietly spending quality. The rest tell you whether the machine that produces those two numbers is healthy.
X. A 90-day FinOps rollout
AI FinOps is a practice, not a purchase, and it stands up in a quarter if it is sequenced so each step earns the trust to take the next. The path below assumes traffic can be routed through a gateway, because attribution and levers both need the place where context passes.
- 1Days 1 to 15: Route and observe. Point workflows at the gateway and change nothing. Establish a per-workflow baseline of calls, tokens, cost, latency, and prompt composition. The output is an honest starting picture.
- 2Days 16 to 30: Attribute. Tag every request to a workflow, team, and customer. Replace the aggregate invoice with the itemized ledger, and reach attribution coverage above ninety five percent.
- 3Days 31 to 45: Define success and forecast. With Product and Finance, define the successful outcome for the top workflows and start measuring cost per successful outcome. Build the calls times context times rate forecast for next quarter.
- 4Days 46 to 70: Apply levers behind the gate. For the highest-spend workflows, test compression, routing, caching, and policy against a held-out evaluation set. Promote only configurations that hold the pass rate.
- 5Days 71 to 90: Operationalize. Assign the RACI, set budgets and KPI targets, and schedule the monthly review. Put the before-and-after of every promoted change on record so savings are defensible and drift is visible.
At the end of ninety days the enterprise has an itemized ledger, a driver-based forecast, a set of proven levers, a definition of success, and a named owner for each. That is the difference between having a dashboard and running AI FinOps.
XI. Conclusion
Counting tokens was the right first step and it is a poor last one. A dashboard tells you the bill went up. AI FinOps tells you which workflow caused it, where it is heading next quarter, which lever bends it, what the trade against quality is, and who is accountable for the answer. The work is not buying more visibility. It is building the discipline that turns visibility into control: attribution at the source, forecasts from drivers, levers behind a gate, unit economics tied to business value, and an operating model with a single owner per outcome. The enterprises that do this will not just spend less on AI. They will be able to say, line by line, what every dollar of it bought.
References
- [1]FinOps Foundation, The FinOps Framework: Principles, Phases, Domains, and Capabilities, finops.org.
- [2]FinOps Foundation, FinOps for AI and the Scopes specification, finops.org.
- [3]Microsoft Learn, Azure Cost Management and Billing documentation.
- [4]Microsoft, Azure OpenAI Service pricing and quota documentation.
- [5]Sloss, Greenberg, Murphy, and Beyer, eds., Site Reliability Engineering and The Site Reliability Workbook, on SLOs and error budgets, Google, 2016 and 2018.
- [6]Ryshe, The Quanta Spec-Review Benchmark v1: Methodology and Scoring.

