The Enterprise LLM Observability Maturity Model

Abstract

Most enterprises monitor their language model usage the way they monitored their first cloud bill: a single aggregate number that arrives late, explains nothing, and prompts a meeting. This paper presents a five-level maturity model for LLM observability, from no visibility at all to a governed practice that attributes cost and quality to the workflow that caused them and forecasts both. It defines what is visible at each level, what remains hidden, the action each level makes possible, and the concrete steps to advance. It treats observability not as charts but as the precondition for control.

Key takeaways

1.LLM observability differs in kind from service monitoring because output is non-deterministic, quality is not a status code, and the most dangerous failure is a wrong answer that looks right.
2.Maturity advances through five levels: blind, aggregate dashboards, request-level traces, attribution with quality signals, and governed with forecasting.
3.Cost visibility and quality visibility advance on separate tracks, and most organizations instrument cost long before they instrument quality.
4.Attribution is the hinge: turning one aggregate number into a per-workflow, per-team, per-customer ledger is what makes every later capability possible.
5.Request-level telemetry must capture tokens, cost, latency, model and version, policy decisions, a groundedness signal, and a workflow tag, or later attribution is impossible.
6.The loop that matters is detect, decide, change, verify; observability without that loop is just charts.
7.You advance one level at a time, and each level de-risks the next, so the goal is the next rung rather than the top of the ladder.

Executive summary

Observability for language models is not a smaller version of observability for services. A service either returns a result or it does not, and a status code tells you which. A language model almost always returns something. The hard question is not whether it answered but whether the answer was right, grounded, and worth what it cost. That question is invisible to every tool built to watch request rates and error codes.

Enterprises move through predictable stages as they learn to see their AI operations. At the start they have a monthly invoice and a guess. At the end they have a per-workflow ledger of cost and quality, a budget they can forecast, and policy that acts before spend or risk lands. Most organizations sit at the second or third stage and do not know it, because the gap between an aggregate dashboard and a governed practice is not obvious until something goes wrong.

70–90%

of enterprises can report total AI spend but not spend per workflow (illustrative)

3 of 5

maturity levels reached by the typical adopter before quality is measured at all (illustrative)

<10%

of teams forecast next quarter's AI cost from instrumented telemetry rather than extrapolation (illustrative)

This paper defines the five levels, gives a matrix that locates any organization in minutes, specifies the telemetry that each level requires, and describes the loop that turns observation into change. The argument throughout is simple: observability that does not change a decision is decoration. The point of seeing is to act.

I. Why observability is different for language models

Two decades of operational practice taught engineering teams to watch a small set of signals: request rate, error rate, latency, saturation. Those signals work because a conventional service is deterministic in the way that matters. The same input produces the same output, success and failure are distinguishable by a status code, and a 200 means the system did its job. Language models break all three assumptions, and the tools built on them inherit blind spots that no amount of tuning closes.

Output is non-deterministic

The same prompt sent twice can return different text. Sampling, temperature, model updates on the provider side, and changing retrieved context all mean there is no fixed expected output to compare against. A diff is not a test. Observability has to describe a distribution of behavior rather than confirm a single correct response, and that changes what is worth recording: not just what came back, but enough about the inputs to explain why.

Quality is not a status code

A model returns 200 OK while confidently stating something false. The transport succeeded; the answer failed. None of the signals that an HTTP monitor watches can tell the difference, because the failure lives in the content, not the envelope. Measuring quality requires a second system that reads the output and scores it, whether by checking groundedness against the provided context, running an evaluation suite, or collecting human judgment. Quality is a signal you have to build, not one the protocol hands you.

Cost is per token and per call

A conventional request costs roughly the same whether the payload is short or long. A model request is billed by the token, on input and output, so two calls to the same endpoint can differ in cost by an order of magnitude depending on how much context rode along. Cost is therefore a per-request property that has to be computed and attributed at the moment of the call, not a flat rate you can assume. An aggregate spend number hides exactly the variation that explains it.

Failure often looks like success

The most expensive failure mode is the plausible wrong answer. A service that returns a 500 raises an alert and someone investigates. A model that returns a fluent, well-formatted, incorrect answer raises nothing. It flows downstream, gets trusted, and is discovered only when a human notices or a customer complains. Observability for language models has to surface the failures that do not announce themselves, which means watching content and grounding rather than waiting for an error.

II. The five levels

Organizations do not jump from blindness to governance. They climb. Each level adds a class of visibility, removes a class of guesswork, and makes a new kind of action possible. Naming the levels gives a team a shared vocabulary for where it is and what the next rung requires. For each level below, three questions matter: what can you see, what can you still not see, and what does an organization at this level typically look like.

Level 0: Blind

What you can see: almost nothing. There is a provider invoice at the end of the month and, perhaps, a sense that usage is growing. Calls go from applications straight to the model with no shared path, so there is no place to observe them even in principle.

What you still cannot see: which application spent what, how many calls were made, how large the prompts were, whether any answer was correct, or whether spend is about to spike. Every question of cost or quality is answered with a guess.

Typical organization: early adoption, several teams experimenting independently, each holding its own API key. AI is a line of code in a product, not a system anyone operates. The first surprise invoice is usually what ends Level 0.

Level 1: Aggregate dashboards

What you can see: totals. Spend this month, total tokens, total calls, maybe a daily trend line, usually taken straight from the provider's own console. The number is real and it is finally on a screen someone watches.

What you still cannot see: any decomposition. The total cannot be split by workflow, team, model, or customer. You know spend rose twenty percent and cannot say why, which feature drove it, or whether it was worth it. Quality is entirely absent from the picture.

Typical organization: a central platform or finance function has noticed AI spend and wants a number to report. The number exists. It explains nothing, and the first time leadership asks 'what is driving the increase' the dashboard has no answer.

Level 2: Request-level traces

What you can see: individual requests. Each call is recorded with its tokens, latency, model, and cost, and you can open a single request and inspect it. The data exists at the grain where cost is actually generated.

What you still cannot see: the aggregate meaning of those requests, unless they carry tags. A million traces with no workflow label is a haystack, not a ledger. And quality is still missing: you can see that a request happened and what it cost, but not whether the answer was any good.

Typical organization: an engineering team has adopted tracing, often through OpenTelemetry, and instruments calls because they instrument everything. The data is rich and largely unused for cost or governance decisions because no one has tagged it to the things leadership cares about.

Level 3: Attribution and quality signals

What you can see: cost and quality mapped to the things that own them. Every request is tagged to a workflow, team, and customer, so the aggregate becomes a ledger you can sort. Alongside cost, a quality signal travels with each request: a groundedness score, an evaluation result, or collected human feedback. For the first time you can say which workflow is expensive and whether it is also accurate.

What you still cannot see: the future. You can explain the present and the past in detail, but spend and risk still land before policy reacts. Governance is observational, not active. You learn that a workflow blew its budget after it did.

Typical organization: a deliberate AI platform team that treats LLM operations as a discipline. They have a gateway or equivalent chokepoint, they tag traffic, and they have stood up at least one quality measurement. This is a strong position and a minority of organizations reach it.

Level 4: Governed and forecasted

What you can see: everything Level 3 sees, plus the future and the guardrails. Cost and quality per workflow feed budgets you can forecast, and policy acts at the moment of the request rather than in a monthly review. A workflow that exceeds its budget is throttled or routed to a cheaper model by rule; a prompt that fails a groundedness threshold is flagged or blocked; an attempt to send a regulated data class is stopped before it reaches the model.

What you still cannot see: nothing that matters is structurally hidden. The remaining work is refinement, better forecasts, better evaluations, tighter policy, rather than a new class of visibility.

Typical organization: a mature AI operations practice, usually in a regulated or cost-sensitive enterprise, where AI is a governed system with owners, budgets, and policy. Observability here is wired to action: the same telemetry that explains the past enforces the present and predicts the next quarter.

III. The maturity matrix

The five levels become a diagnostic when laid side by side. The matrix below lets a team locate itself by reading across a row and finding the one that matches what it can actually do today, not what it intends to build.

Level	Cost visibility	Quality visibility	Governance	Action available	Usually here
0 Blind	Invoice only	None	None	Guess and react	Early experimenters
1 Aggregate	Total spend	None	None	Report a number	Finance or central IT
2 Traces	Per request	None	Manual review	Inspect one call	Engineering teams
3 Attribution	Per workflow, team, customer	Per-request quality signal	Observational	Explain and rank spend	AI platform teams
4 Governed	Per workflow plus forecast	Quality gates and drift	Active policy at request time	Forecast, throttle, route, block	Mature AI operations

The LLM observability maturity matrix. Read across to find the row that matches current capability.

Two patterns are worth naming. First, cost visibility and quality visibility advance on separate tracks, and cost almost always leads. A team can reach detailed cost attribution while still scoring zero on quality, which is why a high cost-maturity organization can ship confident wrong answers it never measures. Second, the jump that changes everything is Level 2 to Level 3: it is the move from data to attribution, and it is the one most teams underestimate because the traces already exist and the work feels done.

IV. Request-level telemetry

Everything above Level 2 depends on what you capture at the moment of the call. Telemetry not recorded at request time cannot be reconstructed later: you cannot tag a request to a workflow after the fact if you did not carry the workflow identity into the call, and you cannot score groundedness against a context you did not store. The schema below is the minimum that makes Levels 3 and 4 reachable.

Field	Example	Why it is required
Input tokens	7,420	Largest and fastest-growing cost component
Output tokens	240	Completes the cost picture; priced higher per token
Cost	$0.0231	Computed per request, not assumed from a flat rate
Latency	1,840 ms	User experience and a proxy for prompt size
Model	gpt-5.5-class	Cost and quality both depend on which model ran
Model version	2026-04	Provider updates change behavior; drift is invisible without it
Policy decisions	allow, redact PII	What governance did, for audit and tuning
Groundedness	0.91	Whether the answer was supported by its context
Workflow tag	claims-assistant	The key that turns traces into an attributable ledger

Minimum request-level telemetry. Capture all of it at the moment of the call.

The workflow tag is the field teams most often omit and most regret omitting. Without it, a million traces are a pile of individually inspectable calls that cannot be summed into anything a budget owner can use. Capturing it costs almost nothing at the call site and is the single highest-return field in the schema. The model version field runs a close second: when a provider silently updates a model and quality shifts, the only way to correlate the change is to have recorded which version answered.

V. Quality signals

Cost telemetry is mechanical: tokens are countable and price is published. Quality telemetry has to be constructed, and a mature practice builds several signals rather than relying on one, because each sees a different failure.

Groundedness scoring

For any workflow that answers from provided context, the first quality question is whether the answer was actually supported by that context. Groundedness scoring compares the generated answer against the retrieved passages and flags claims that have no support. It is the most direct check on the plausible wrong answer, and because it runs per request it produces a continuous signal rather than a periodic audit.

Evaluation pass rate

A held-out set of representative cases with known good answers, scored on a schedule or on every change, gives a stable quality number that does not depend on live traffic. Evaluation pass rate is what lets a team change a prompt, swap a model, or compress context and know whether quality held. It is the signal that turns cost optimization into a quality-constrained decision rather than a gamble.

Human feedback

Thumbs up and down, corrections, escalations, and abandonment are the judgments of the people the system actually serves. Human feedback is sparse and noisy, but it catches failures the automated signals miss because it reflects whether the answer was useful, not just whether it was grounded or matched a reference. Routed back to the workflow that produced it, feedback becomes a per-workflow quality trend.

Drift detection

Quality is not static. A provider updates a model, a document corpus changes, usage shifts toward cases the system handles poorly, and quality degrades without any code change. Drift detection watches the quality signals over time and alerts when they move, so the question 'is the system still as good as it was' has a continuous answer rather than a quarterly surprise.

VI. Attribution: from an aggregate to a ledger

Attribution is the hinge of the whole model. Below it, observability describes a system in the abstract. Above it, observability describes who is responsible for what. The mechanism is simple and the consequence is large: tag every request with the workflow, team, and customer it serves, and the single aggregate number becomes a table that can be sorted, ranked, charged back, and held to a budget.

Cost per workflow, team, and customer

Once requests carry tags, total spend decomposes. You can rank workflows by cost and find that, as is typical, a few of them account for most of the bill. You can show a team its own consumption. You can compute the cost to serve a given customer, which for usage-based pricing is the difference between a margin you can defend and one you are guessing at. The aggregate was a wall; the ledger is a map.

Quality per workflow, team, and customer

The same tags route quality signals to the same owners. Now the two pictures sit side by side: this workflow is expensive and accurate, that one is cheap and ungrounded, this customer's traffic is both costly and low quality. Decisions that were impossible become obvious. You invest in the workflows that earn it, fix the ones that are cheap but wrong, and reprice the customers whose usage does not pay for itself.

Attribution is what makes every later capability possible. A budget needs something to budget; a forecast needs a unit to forecast; a chargeback needs a tag to charge. None of it exists without the discipline of carrying identity into every request and recording it. This is why the Level 2 to Level 3 jump is the one that matters most: the traces were already there, but until they are attributed they cannot be governed.

VII. From observability to action

A dashboard that no one acts on is a cost, not a capability. The value of seeing is the decision it changes, and a mature practice wires its telemetry into a closed loop rather than a wall of charts. The loop has four steps, and a practice is only as mature as the slowest one it closes.

1Detect. A signal crosses a threshold: a workflow's cost per request jumps, a groundedness score falls, an evaluation set regresses, a budget nears its limit. Detection is automatic and continuous, not a human noticing a trend in a monthly meeting.
2Decide. The signal maps to a choice: compress the context, route to a cheaper model, tighten retrieval, roll back a prompt change, or throttle a runaway workflow. Attribution makes the decision specific because it points at the workflow responsible rather than the aggregate.
3Change. The decision is applied, ideally as configuration at the gateway rather than a code change in an application, so the action is fast and reversible.
4Verify. The same telemetry that detected the problem confirms the fix: cost fell, quality held, the budget is back in range. Verification closes the loop and guards against a change that cut cost while quietly degrading the answer.

The loop is what separates Level 4 from Level 3. A Level 3 organization can perform detect and decide but executes change and verify by hand, slowly, after the spend or the bad answer has already landed. A Level 4 organization closes the loop at request time: the budget rule throttles before the overrun, the groundedness gate flags before the answer ships, and verification is continuous. Observability that does not change anything is just charts; the loop is the difference between watching and operating.

VIII. How to advance one level

Maturity is reached one rung at a time, and trying to skip rungs usually means building governance on telemetry that is not there yet. The sequence below moves an organization from the bottom toward the top, and each step de-risks the next.

1From 0 to 1: route AI traffic through a shared path. Put a gateway or equivalent chokepoint between applications and models so that calls become observable in principle, then publish total spend, tokens, and calls on a screen someone owns.
2From 1 to 2: record every request. Capture tokens, cost, latency, model, and version per call, using a standard such as OpenTelemetry so the data lands in tools the organization already runs.
3From 2 to 3: tag and attribute. Carry a workflow, team, and customer identity into every request, and decompose spend by those tags. In parallel, stand up at least one quality signal, usually groundedness scoring or a small evaluation set.
4From 3 to 4: govern and forecast. Set budgets per workflow, forecast from the attributed history, and turn the strongest signals into request-time policy that throttles, routes, gates, or blocks. Close the detect, decide, change, verify loop so action no longer waits for a human review.
5Sustain: watch for drift and revisit the lower rungs. Provider updates, corpus changes, and new workflows erode quality and inflate cost silently, so the practice keeps measuring rather than treating maturity as a finished project.

The order is not arbitrary. Attribution before governance, telemetry before attribution, a shared path before telemetry. An organization that tries to set budgets before it can attribute spend is governing a number it cannot decompose, and the budget will be ignored the first time someone asks which workflow to cut.

IX. A self-assessment

The fastest way to find your level is to answer a short set of questions honestly, counting only what you can do today rather than what is on a roadmap.

Locate your level in five questions

1) Can you report total AI spend without opening a provider console? If no, you are at Level 0. 2) Can you split that spend by workflow, team, or customer? If no, you are at Level 1 or 2. 3) Can you open a single request and see its tokens, cost, model, and version? If yes, you are at least Level 2. 4) Does a quality signal such as groundedness or an evaluation pass rate travel with your requests? If yes, you are reaching Level 3. 5) Does policy act at request time, by rule, before spend or risk lands, and can you forecast next quarter's cost from your own telemetry? If yes, you are at Level 4. Your level is the highest one for which every prior answer also holds.

The common result is a split: cost answers that reach Level 3 and quality answers stuck at Level 0. That is not a failure of effort. It is the predictable shape of a practice that instrumented the easy signal first. Naming the split is the value of the assessment, because it points at the specific next rung rather than a vague ambition to do better.

X. Conclusion

Language models broke the assumptions that conventional monitoring was built on. Output is non-deterministic, quality is content rather than a status code, cost varies per request, and the worst failure is the one that looks like success. An organization that watches its AI the way it watched its services will see request rates and error codes and miss every question that matters: which workflow is expensive, which one is wrong, and what next quarter will cost.

The maturity model is a way to stop guessing. It names the rungs, shows that cost and quality climb on separate tracks, and identifies attribution as the hinge that turns one aggregate number into a ledger anyone can act on. The destination is not more dashboards. It is a practice where the same telemetry that explains the past enforces the present and forecasts the future, wired into a loop that detects, decides, changes, and verifies. Observability that does not change a decision is decoration. The enterprises that wire seeing to action will spend less, ship fewer wrong answers, and be able to say, on any request, exactly what their AI did and why.

References

[1]Beyer, Jones, Petoff, Murphy (eds.), Site Reliability Engineering, O'Reilly, 2016 (Service Level Objectives and error budgets).
[2]OpenTelemetry, Specification and Semantic Conventions, Cloud Native Computing Foundation.
[3]FinOps Foundation, FinOps Framework: Principles, Domains, and Capabilities.
[4]NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, 2023.
[5]Liang et al., Holistic Evaluation of Language Models (HELM), Stanford CRFM, 2022.
[6]Ryshe, The Quanta Telemetry Schema v1: Fields, Tags, and Quality Signals.