QUANTA
by Ryshe
All papers
White Paper IIIJune 2026 · 17 min read

AI FinOps Beyond Token Dashboards

From counting tokens to controlling the cost of enterprise AI

Ryshe · AI, Cloud & Security

Bronze statuette of Hermes
Abstract

Most enterprises now have a token dashboard. Very few have AI FinOps. A dashboard reports what was spent after it was spent, in an aggregate that no team can act on. AI FinOps is a discipline: it attributes spend to the workflows and customers that caused it, forecasts where the curve is heading, applies reduction levers behind a quality gate, and assigns ownership across Finance, Engineering, Governance, and Product. This paper adapts the FinOps inform, optimize, and operate cycle to large language model spend, defines cost per successful outcome as the metric that matters, and gives a 90-day rollout. It is written for the CIO and CFO who have seen the AI line item and now need a way to govern it.

Key takeaways
  • 1.A token dashboard reports cost, it does not attribute or control it, and reporting is the easy part.
  • 2.AI FinOps adapts the FinOps inform, optimize, and operate cycle to per-request LLM spend.
  • 3.Attribution replaces an aggregate invoice with an itemized ledger tied to workflow, team, and customer.
  • 4.Spend forecasts well as calls times context size times rate, which makes growth scenarios testable before they arrive.
  • 5.Reduction levers, compression, routing, caching, and policy, only count when gated on evaluation and measured in cost per successful task.
  • 6.Unit economics, cost per successful outcome, is the metric that ties AI spend to business value and the only one a CFO can defend.
  • 7.AI spend needs a named owner and a RACI, or it stays everyone's concern and no one's responsibility.

Executive summary

A spend chart is not a control. It explains the past, it does not change it, and it almost never tells you which workflow, team, or customer caused the number. Enterprises have responded to rising AI bills by buying visibility, then discovering that visibility alone moves nothing. The chart goes up, everyone agrees it should go down, and no one owns the lever that would bend it.

AI FinOps is the practice that turns a dashboard into a decision. It borrows the FinOps cycle the industry already uses for cloud, inform, optimize, and operate, and applies it to the one cost center that grows per request rather than per server: tokens sent to and received from a model. The unit of account is not the token. It is the successful business outcome the tokens were spent to produce.

60–80%
of LLM spend that is input context in RAG and agent workloads (typical range)
2–4×
spend growth a quarter is common as a workload moves from pilot to platform (illustrative)
30–60%
cost reduction available from quality-gated levers without changing the model (illustrative)

This paper draws a hard line between observation and control. It shows how to replace an aggregate invoice with an itemized ledger, how to forecast spend as a product of calls, context size, and rate, and how to manage AI cost as a quality-constrained optimization rather than a race to the cheapest prompt. The result is an operating model in which AI spend has an owner, a forecast, a set of levers, and a definition of success that Finance and Engineering can both sign.

I. The dashboard trap

The first thing an enterprise builds when the AI bill arrives is a dashboard. It is the right instinct and the wrong stopping point. A dashboard answers one question, how much did we spend, and it answers it well. It does not answer the three questions that would let anyone act: who caused it, where is it heading, and what do we do about it.

Spend charts explain the past

A monthly spend chart is a record of decisions already made and tokens already billed. By the time a number appears on it, the workflows that produced the number have run, the context they sent is gone, and the only thing left is the invoice. You can study the chart as long as you like. It describes a world you can no longer change.

Aggregation hides the cause

Provider and cloud billing report spend in aggregate: total tokens, total dollars, perhaps a split by model. They do not tell you that one customer's nightly batch job is forty percent of the bill, or that a single agent workflow fanned a thousand user actions into thirty thousand model calls. The cost that matters is always a particular workflow serving a particular user, and aggregation is designed to erase exactly that detail.

Visibility is not control

The gap between seeing a cost and changing it is the whole problem. A dashboard is a thermometer. AI FinOps is the thermostat. The rest of this paper is about the difference: attribution that names the cause, forecasting that anticipates the curve, levers that bend it, and an owner who is accountable for the result.

II. What AI FinOps actually is

FinOps is a discipline the industry built for cloud, where elastic resources made spend a real-time engineering decision rather than a procurement event. Its cycle is three phases that repeat: inform, optimize, and operate. AI spend has the same shape as cloud spend, variable, demand-driven, and generated by engineering choices, so the same cycle applies. What changes is the unit being managed.

Inform

Make spend visible and attributable. For AI this means decomposing every request into its components, system prompt, retrieval, history, tool schemas, and output, and tagging it to a workflow, team, and customer. Inform is where a dashboard stops and AI FinOps begins, because inform demands attribution, not just a total.

Optimize

Reduce cost without trading away the outcome. The levers are compression, model routing, caching, and policy, covered in Section V. The constraint that separates optimization from cost-cutting is quality: every reduction is a hypothesis tested against an evaluation set before it ships.

Operate

Run the practice continuously. Set budgets and forecasts, assign ownership, define KPIs and SLOs for cost, and review them on a cadence. Operate is what keeps the savings from drifting back, because nothing about AI spend stays fixed once usage and documents change.

The unit of account

Cloud FinOps manages cost per unit of work served. AI FinOps manages cost per successful outcome: the answer a user accepted, the ticket resolved, the document correctly extracted. Token count is an input to that number, not the number itself.

III. Attribution: from invoice to ledger

Attribution is the foundation, because every other capability depends on it. You cannot forecast a workflow you cannot isolate, cannot apply a lever to a cost you cannot name, and cannot assign ownership for a number no one can decompose. The move is to replace one aggregate invoice with a ledger that itemizes spend along the dimensions a business actually manages: workflow, team, customer, and request.

Tag at the point of the call

Attribution has to happen where the request is made, because that is the only place the context still exists. A gateway between applications and models sees every call with its full composition and its calling identity, and can stamp each one with the workflow that issued it, the team that owns that workflow, and the customer it served. Stamped at the source, the same data rolls up into any view Finance or Engineering needs.

WorkflowTeamCallsAvg input tokensShare of spendMonthly cost
Support copilotCX1,240,0006,80038%$41,800
Contract extractionLegal Ops210,00012,40022%$24,200
Sales research agentRevenue96,0009,10016%$17,600
Internal knowledge botIT880,0003,20014%$15,400
Marketing draftingBrand140,0002,4006%$6,600
Everything elseVarious320,0001,9004%$4,400
Totaln/a2,886,000n/a100%$110,000
An aggregate invoice replaced by an itemized ledger for one month (illustrative figures).

The aggregate version of this table is a single cell: one hundred ten thousand dollars. The ledger version is a set of decisions. The support copilot is the obvious target for compression, contract extraction is the highest cost per call and worth a routing review, and marketing drafting is too small to spend an engineer's week on. None of that is visible on an invoice.

IV. Forecasting: modeling the curve before it arrives

A forecast is what turns AI spend from a monthly surprise into a planned budget line. AI spend forecasts unusually well because its drivers are explicit. For any workflow, monthly cost is approximately calls times average context size times rate, summed over input and output. Each of the three is something the business can estimate and the gateway can measure.

The three drivers

  • Calls: the number of model invocations, which for agentic workflows is user actions multiplied by fan-out, not user actions alone.
  • Context size: average input plus output tokens per call, the lever most under engineering control and the one that grows silently.
  • Rate: provider price per thousand tokens, which differs by model and by input versus output, and which routing decisions can change.

A worked projection

Take the support copilot from Section III at its current run-rate and project a quarter forward under three growth scenarios. The point is not the precise dollar figure. It is that a model built from drivers lets you see the curve, and the effect of a lever, before the invoice confirms it.

ScenarioMonthly callsAvg tokens / callBlended rate / 1KProjected monthly costvs today
Today (baseline)1,240,0007,010$0.0048$41,8000%
Flat usage, no action1,240,0007,010$0.0048$41,8000%
Adoption +40%1,736,0007,010$0.0048$58,500+40%
Adoption +40%, context unmanaged1,736,0009,100$0.0048$75,900+82%
Adoption +40%, levers applied1,736,0003,900$0.0042$28,400-32%
Total addressable swingn/an/an/a$47,500n/a
Next-quarter projection for one workflow under usage growth scenarios (illustrative).

The two scenarios that matter sit next to each other. Adoption rises forty percent either way. In one the context drifts up unmanaged and spend nearly doubles. In the other the same adoption is met with levers and spend falls. The difference between those two outcomes, roughly forty seven thousand dollars a month for one workflow, is the entire value of running AI FinOps instead of watching a chart.

V. Active reduction levers

Forecasting tells you where the curve is heading. Levers change it. There are four that matter, and they compose: each addresses a different part of the cost equation, and a mature practice runs all four behind a single evaluation gate.

LeverMechanismTypical savingPrimary risk
CompressionSummarize history, prune and rerank retrieval, dedupe instructions, trim tool schemas30–60% of inputDropping context the task needed
Model routingSend easy requests to a cheaper model, reserve the frontier model for hard ones20–50% of spendQuality regression on misrouted requests
CachingReuse responses or prefixes for repeated or near-repeated inputs10–40% where inputs repeatStale answers, low hit rate on open-ended traffic
PolicyEnforce per-workflow token budgets, block known-wasteful patterns at the gateway5–20% of spendOver-restriction that breaks a legitimate workflow
The four reduction levers, with typical savings and the risk each carries (illustrative ranges).

Levers compose, savings do not simply add

Applied together the levers interact. Compression shrinks the context that routing then prices at a lower rate, and policy caps the outliers that caching cannot help. The combined effect is usually less than the naive sum of each lever's headline number and far more than any one of them alone. The discipline is to measure the combination, not to claim the sum.

Every lever is a hypothesis

Each lever is a bet that the workflow does not need what the lever removes or reroutes. That bet can be wrong, and the only way to know is to test it. Which is the subject of the next section.

VI. Quality as a budget constraint

The fastest way to cut AI spend is to cut quality, and it is almost always the wrong trade. A prompt stripped to nothing is cheap and useless. The number that protects against this is cost per successful task, not cost per token, and the mechanism that enforces it is the evaluation gate.

The eval gate

Before any lever ships, it runs against a held-out evaluation set that scores the outcome the workflow exists to produce: the correct extraction, the accepted answer, the resolved ticket. A candidate configuration promotes only if it preserves the pass rate. A change that saves money and lowers the pass rate does not promote, however large the saving. Cost control is run as a quality-constrained optimization, with quality as the constraint and cost as the objective.

ConfigurationCost / callEval pass rateCost / successful taskDecision
Baseline$0.033695.0%$0.0354n/a
Aggressive compression$0.012188.0%$0.0138Reject
Gated compression + routing$0.014995.4%$0.0156Promote
Two candidate configurations, judged on cost per successful task (illustrative).

The aggressive configuration is cheaper per call and looks like the winner on a token dashboard. Judged on cost per successful task it is only a little cheaper than the gated option, and it sheds seven points of quality to get there. The gated configuration wins on the only metric that accounts for both sides of the trade.

VII. The unit economics of AI

Cloud taught Finance to ask for cost per unit: per order, per active user, per gigabyte served. AI needs the same discipline, and its natural unit is the successful outcome. Cost per successful outcome is total AI spend for a workflow divided by the number of successful results it produced, where success is defined by the business, not by the model.

Defining the outcome

The outcome has to be something the business already values and can count. For a support copilot it is a resolved conversation. For contract extraction it is a document processed without human correction. For a sales agent it is a qualified research brief a rep used. The definition is a joint decision of Product and Finance, and it is the most important number in the practice because everything else is measured against it.

Tie spend to value, not to volume

A workflow whose token spend doubled but whose cost per successful outcome fell is succeeding: it is doing more valuable work more efficiently. A workflow whose spend held flat while its cost per successful outcome rose is failing quietly. Volume metrics cannot tell these two apart. Unit economics can.

Once cost per successful outcome exists, the AI line item stops being an unexplained expense and becomes an investment with a return. A CFO can compare it to the cost of the human process it replaces, set a target, and hold the workflow to it. That comparison is the conversation AI FinOps exists to enable.

VIII. The operating model: who owns AI spend

A cost with no owner is a cost no one controls. The most common failure in enterprise AI is not technical: it is that AI spend belongs to everyone and therefore to no one. Engineering builds the workflows, Finance receives the bill, Governance worries about the risk, and Product sets the roadmap, and the spend falls through the gap between them. AI FinOps closes that gap with an explicit operating model.

ActivityFinanceEngineeringGovernanceProduct
Set AI budget and forecastACIC
Attribute spend to workflowsCRII
Apply reduction leversIACC
Define successful outcomeCCIA
Run the evaluation gateIACC
Set cost policy and limitsCRAC
Review KPIs and SLOsARCC
RACI for AI cost across the four functions (R responsible, A accountable, C consulted, I informed).

The pattern that matters is single accountability per row. Finance is accountable for the budget, Engineering for the levers and the gate, Governance for policy, and Product for the definition of success. Everyone is consulted or informed where they have a stake, but exactly one function answers for each outcome. A practice without this table tends to relitigate ownership every time the bill moves.

IX. KPIs and SLOs for AI cost

What gets reviewed gets managed. AI FinOps runs on a small set of metrics with target ranges, reviewed on a cadence by the owners in the RACI. The targets below are starting points for a maturing practice, not measured results, and each enterprise should set its own from its baseline.

MetricDefinitionTarget rangeReview cadence
Cost per successful outcomeWorkflow spend / successful resultsFlat or down quarter on quarterMonthly
Attribution coverageShare of spend tagged to a workflowAbove 95%Monthly
Forecast accuracyActual vs projected monthly spendWithin 10%Monthly
Input share of spendInput cost / total cost per workflowTracked, no hard capMonthly
Eval pass rate at promotionQuality on held-out set when a lever shipsAt or above baselinePer change
Budget varianceActual spend vs approved budgetWithin 5%Monthly
Cache hit rateServed from cache / cacheable requestsWorkload dependentMonthly
Core KPIs and SLOs for AI cost, with starting target ranges (illustrative).

The two non-negotiable rows are cost per successful outcome and eval pass rate at promotion. Together they guarantee that the practice never buys a lower cost by quietly spending quality. The rest tell you whether the machine that produces those two numbers is healthy.

X. A 90-day FinOps rollout

AI FinOps is a practice, not a purchase, and it stands up in a quarter if it is sequenced so each step earns the trust to take the next. The path below assumes traffic can be routed through a gateway, because attribution and levers both need the place where context passes.

  1. 1Days 1 to 15: Route and observe. Point workflows at the gateway and change nothing. Establish a per-workflow baseline of calls, tokens, cost, latency, and prompt composition. The output is an honest starting picture.
  2. 2Days 16 to 30: Attribute. Tag every request to a workflow, team, and customer. Replace the aggregate invoice with the itemized ledger, and reach attribution coverage above ninety five percent.
  3. 3Days 31 to 45: Define success and forecast. With Product and Finance, define the successful outcome for the top workflows and start measuring cost per successful outcome. Build the calls times context times rate forecast for next quarter.
  4. 4Days 46 to 70: Apply levers behind the gate. For the highest-spend workflows, test compression, routing, caching, and policy against a held-out evaluation set. Promote only configurations that hold the pass rate.
  5. 5Days 71 to 90: Operationalize. Assign the RACI, set budgets and KPI targets, and schedule the monthly review. Put the before-and-after of every promoted change on record so savings are defensible and drift is visible.

At the end of ninety days the enterprise has an itemized ledger, a driver-based forecast, a set of proven levers, a definition of success, and a named owner for each. That is the difference between having a dashboard and running AI FinOps.

XI. Conclusion

Counting tokens was the right first step and it is a poor last one. A dashboard tells you the bill went up. AI FinOps tells you which workflow caused it, where it is heading next quarter, which lever bends it, what the trade against quality is, and who is accountable for the answer. The work is not buying more visibility. It is building the discipline that turns visibility into control: attribution at the source, forecasts from drivers, levers behind a gate, unit economics tied to business value, and an operating model with a single owner per outcome. The enterprises that do this will not just spend less on AI. They will be able to say, line by line, what every dollar of it bought.

References

  1. [1]FinOps Foundation, The FinOps Framework: Principles, Phases, Domains, and Capabilities, finops.org.
  2. [2]FinOps Foundation, FinOps for AI and the Scopes specification, finops.org.
  3. [3]Microsoft Learn, Azure Cost Management and Billing documentation.
  4. [4]Microsoft, Azure OpenAI Service pricing and quota documentation.
  5. [5]Sloss, Greenberg, Murphy, and Beyer, eds., Site Reliability Engineering and The Site Reliability Workbook, on SLOs and error budgets, Google, 2016 and 2018.
  6. [6]Ryshe, The Quanta Spec-Review Benchmark v1: Methodology and Scoring.
Read next · White Paper IV
Governing Agentic Workflows Before They Scale

Artwork: Bronze statuette of Hermes, public domain (CC0), The Met, Open Access.