Enterprises already place gateways in front of every other critical dependency. APIs pass through an API gateway, traffic through a load balancer and firewall, identity through a single sign-on broker. Large language model traffic is the exception. Applications call models directly, with no shared point where policy can be enforced, cost attributed, context inspected, or behavior recorded. This paper defines the context gateway as the control plane that closes that gap: a single policy-enforcing path that every application points at instead of the model. It explains the four capabilities a gateway consolidates, why those capabilities cannot live in application code, the SDK, or a provider dashboard, and how an OpenAI-compatible gateway is adopted as a configuration change rather than a migration. It closes with a staged rollout, a point-solution comparison, and honest answers to the objections an architect should raise.
- 1.Enterprises gate every critical dependency except LLM traffic, which still flows app-to-model with no chokepoint.
- 2.A context gateway is a single policy-enforcing path that applications target instead of the model, and it is model-agnostic by design.
- 3.It consolidates four capabilities into one layer: control of what is sent, compression of context, governance of policy and audit, and observability of cost and behavior.
- 4.These capabilities cannot be enforced from application code, an SDK convention, or a provider dashboard, because none of them sees all traffic or can modify a request in flight.
- 5.An OpenAI-compatible gateway is adopted by swapping a base URL, so it is a configuration change rather than a rewrite.
- 6.A staged rollout that begins in observe-only mode de-risks adoption: value arrives before any policy is enforced.
- 7.Consolidating into a gateway replaces a sprawl of point solutions whose combined maintenance and blind spots cost more than the gateway they could have shared.
Executive summary
Every mature enterprise architecture has chokepoints by design. An API gateway is where rate limits, authentication, and request shaping are enforced. A firewall is where network policy lives. An identity provider is where access decisions are made. These layers exist because policy that is scattered across every caller is policy that cannot be enforced. LLM traffic has no such layer. Applications, agents, and copilots call models directly, and the most expensive, least predictable, and least governed dependency in the modern stack is the one with no control plane in front of it.
A context gateway is that missing layer. It is a single path that sits between enterprise applications and the models they call, speaks the same API the model speaks, and applies policy to every request that passes. Because it is model-agnostic, it decouples the application from any one provider. Because it sees every request, it is the one place where cost, governance, and observability can be made consistent rather than reimplemented per team.
This paper is written for the CIO, CISO, and VP of Engineering who already accepts that AI is now production infrastructure and is asking the next question: where is the control plane. It defines the gateway precisely, contrasts it with the alternatives teams reach for first, places it in an Azure-native reference architecture, and lays out a rollout that starts in observe-only mode and earns the right to enforce policy one stage at a time.
I. The control gap
Ask an enterprise architect to draw the path of an API call and they will draw a gateway. Ask for the path of a database query and they will draw a connection pool, a proxy, a set of credentials scoped by role. Ask for the path of an LLM call and most will draw a straight line from the application to the model. That straight line is the problem.
Every other dependency already has a chokepoint
Chokepoints are not an accident of legacy design. They are where an enterprise expresses intent that must hold regardless of which team wrote the calling code. Rate limiting belongs at the API gateway because no individual service can be trusted to throttle itself on behalf of the whole. Network egress rules belong at the firewall because per-host configuration drifts. Access decisions belong at the identity provider because authorization scattered across applications is authorization no one can audit. The pattern is consistent: when a concern must be enforced uniformly, it is moved to a layer every request must cross.
LLM traffic is the exception, and it is the worst one to leave open
LLM calls are the one dependency that escaped this pattern, and they are arguably the dependency that needs it most. They are billed per token, so cost is variable and unbounded. They are non-deterministic, so behavior cannot be fully predicted from the code. They carry whatever context the application chose to assemble, including, potentially, regulated data. They are the newest entry in the stack, added quickly, often by application teams under delivery pressure, with no shared standard for how a request should be shaped or what it is allowed to contain.
Direct application-to-model calls mean there is no point at which the enterprise can say what may be sent, record what was sent, or change what is sent without redeploying the application that sent it. Policy that cannot be enforced at a single point is policy that exists only as documentation. The control gap is not that enterprises lack opinions about LLM usage. It is that they have no place to apply them.
II. What a context gateway is
A context gateway is a single, policy-enforcing path between enterprise applications and the language models they use. Applications are configured to send their model requests to the gateway. The gateway applies policy, optionally transforms the request, forwards it to the appropriate model, and returns the response, recording what happened along the way. From the application's point of view it is talking to a model. From the enterprise's point of view it now has a control plane.
One path, enforced for everyone
The defining property is singularity. There is one path that every LLM request crosses, and policy applied there applies to all of it, regardless of which team, language, or framework produced the request. A new rule about what data may leave the boundary does not require ten application teams to each ship a change. It is enforced once, at the gateway, for everyone, immediately.
Model-agnostic by design
A context gateway is not bound to one provider. It presents a stable interface to applications and routes to whichever model is appropriate behind that interface: a frontier model for hard reasoning, a smaller model for classification, a regional deployment for a data-residency requirement. The application is decoupled from the model. Swapping providers, adding a fallback, or splitting traffic by task becomes a gateway decision rather than a code change in every caller.
A dumb proxy forwards bytes. A context gateway understands that the payload is a model request: it can decompose the prompt, attribute the cost, redact a field, choose a model, enforce a policy, and emit an audit record. The difference between the two is the difference between a pipe and a control plane.
III. The four capabilities, one layer
A context gateway earns its place by consolidating four capabilities that would otherwise be built, badly and repeatedly, inside every application. Each is valuable alone. The reason to put them in one layer is that they share the same precondition: a single point where every request can be seen and shaped.
Control: decide what may be sent and where it may go
Control is policy applied to the request before it reaches a model. Which applications may call which models. What a request may contain, and what must be stripped or redacted before it leaves the boundary. How many tokens or calls a workflow may consume in a window. Which provider or region a class of data is allowed to reach. Control is the capability that turns informal guidance into enforced rules, because the gateway can refuse, rewrite, or reroute a request that violates them.
Compress: send the context the task needs, not the context it accumulated
Compression reduces the input a request carries without changing the model or, when gated by evaluation, the answer. The gateway can summarize replayed history, rerank and prune retrieved passages, dedupe repeated instructions, and trim tool schemas to the step at hand. Applied centrally, compression is consistent across every workflow and reversible by configuration. This is the subject of the companion paper on context bloat; here it is one of four capabilities the gateway provides, not the whole story.
Govern: prove what happened after the fact
Governance is the record. For every request the gateway can capture which application made it, which model received it, what the prompt was composed of, what policy was applied, and what came back. That record is what lets a regulated enterprise answer the question an auditor or incident reviewer will eventually ask: exactly what data reached which model, on which request, and under what policy. Governance is not a report generated monthly. It is a property of every call passing through one accountable layer.
Observe: see cost and behavior per workflow, in real time
Observability is attribution and measurement. The gateway can tag each request to a workflow, team, and customer, decompose its token usage, track latency and error rates, and watch for drift in behavior or spend. Provider dashboards report aggregates. The gateway reports the breakdown that lets an owner act: which workflow grew, which is failing, which costs what.
IV. Why not the app, the SDK, or the provider
The natural objection is that these capabilities could live somewhere cheaper. In the application code. In a shared SDK. In the provider's own dashboard. Each of those places fails on a property the gateway has by construction, and the failure is the same in every case: none of them is a single point that sees and can modify all traffic.
Application code cannot enforce what it can choose to skip
Policy written into application code is policy each application can implement differently, partially, or not at all. A rule lives in ten codebases in three languages, drifts as each evolves, and is one missed code review away from being silently absent. There is no enterprise-wide enforcement, only a convention that holds until it does not.
An SDK is a convention, not a boundary
A shared client library is better than nothing, but it is opt-in by construction. A team can pin an old version, call the provider directly to debug, or stand up a new service that never imports it. An SDK cannot see traffic that does not flow through it, and it runs inside the application's trust boundary, so it cannot be an independent control point. It improves the common case and enforces nothing.
A provider dashboard reports the past for one provider
Provider consoles show aggregate usage after the fact, for that provider only. They cannot modify a request in flight, cannot attribute spend to your internal workflows, cannot apply your data policy, and cannot give a consolidated view across two providers. They are a billing and quota surface, not a control plane.
| Dimension | Application code | SDK convention | Provider dashboard | Context gateway |
|---|---|---|---|---|
| Enforceable for all traffic | No, per app | No, opt-in | No, read-only | Yes, single path |
| Model-agnostic | Per app effort | Per SDK effort | No, one provider | Yes, by design |
| Can modify context in flight | Yes, per app | Yes, if used | No | Yes, centrally |
| Audit record of what was sent | Fragmented | Partial | Aggregate only | Complete, per request |
| Cost control and attribution | Per app, manual | Per app, manual | Aggregate, after the fact | Per workflow, real time |
| Independent of app trust boundary | No | No | Yes | Yes |
The pattern in the table is the point. Every alternative is strong on one column and absent on the rest. Only a layer that all traffic crosses, that is independent of the application, and that understands the request can hold every column at once.
V. Reference placement and data flow
A context gateway sits on the path between the things that originate model requests and the models that serve them. On one side are the request sources: business applications, autonomous agents, and embedded copilots. On the other are the models: an Azure OpenAI deployment, a second provider held in reserve, a smaller regional model for residency-constrained data. The gateway is the waist of the hourglass, the one segment every request narrows through.
The request lifecycle
A request crosses the gateway in a defined sequence rather than as an opaque forward. The lifecycle is what makes the four capabilities concrete.
- 1Authenticate and identify. The gateway verifies the calling application or agent and tags the request with its workflow, team, and tenant.
- 2Apply policy. Rules decide whether the request is permitted, which models it may reach, and what it may contain. Disallowed or regulated fields are redacted before the request leaves the boundary.
- 3Shape the context. The prompt is decomposed into its parts. Where a compression policy applies and has passed evaluation, history is summarized, retrieval is pruned, and schemas are trimmed.
- 4Route and forward. The gateway selects the appropriate model deployment, by task, cost, or residency, and forwards the shaped request, applying fallback if the primary is unavailable.
- 5Record and return. Cost, token composition, latency, policy applied, and outcome are written to the audit and observability stores, and the response is returned to the caller.
Nothing in this lifecycle is visible to the application beyond a small, bounded latency. The caller sent a model request and received a model response. The enterprise gained a controlled, recorded, attributable transaction.
VI. The OpenAI-compatible adoption path
The reason a gateway can be adopted without a migration is that it speaks the API the application already speaks. Most enterprise LLM code is written against the OpenAI-compatible chat-completions interface, the same shape Azure OpenAI exposes. A gateway that presents that same interface is, to the application, indistinguishable from the model.
A base-URL swap, not a rewrite
Adoption is a configuration change. The application's client is already pointed at a base URL and carries a key. Repoint the base URL at the gateway, issue the gateway's credential, and the next request flows through the control plane. No request or response schema changes. No SDK is replaced. No business logic is touched. The same call that went to the provider now goes to the gateway, which goes to the provider.
# before: application points directly at the provider OPENAI_BASE_URL=https://my-resource.openai.azure.com/ OPENAI_API_KEY=sk-provider-key # after: application points at the context gateway # request and response shapes are unchanged OPENAI_BASE_URL=https://gateway.internal.ryshe.com/v1 OPENAI_API_KEY=gw-scoped-key
Speaking the model API keeps the door open
Because the gateway speaks the model's API rather than inventing its own, the enterprise retains its exits. An application can be repointed back at a provider in the same one-line change that adopted the gateway. There is no proprietary protocol to rip out later. Compatibility is what makes adoption reversible, and reversibility is what makes adoption safe to start.
VII. A staged rollout that de-risks itself
A gateway that begins by enforcing policy asks teams to trust it before it has earned trust. A better rollout inverts that order. It starts by observing, proves its measurements are correct, and enforces only after the data has made the case. Each stage delivers value and de-risks the next.
- 1Observe only. Repoint traffic at the gateway in pass-through mode. Change nothing about the request. Establish a per-workflow baseline of tokens, cost, latency, and prompt composition. The gateway proves it is transparent before it is given any authority.
- 2Attribute. Tag every request to a workflow, team, and tenant, and decompose each prompt into its system, retrieval, history, and schema shares. The enterprise now knows where its spend and its behavior come from.
- 3Apply policy. With the baseline established, turn on enforcement where it is uncontroversial first: provider and model allow-lists, rate limits, and redaction of disallowed fields. Policy is now enforced uniformly, at one point.
- 4Compress behind a gate. For the heaviest workflows, introduce compression as candidate configurations scored against a held-out evaluation set, and promote only those that preserve quality. Savings arrive with proof, not on faith.
- 5Route. With confidence in the path, enable model routing and fallback: cheaper models for simple tasks, regional deployments for residency, automatic failover for resilience. The gateway is now a full control plane.
The first stage changes nothing and exists to prove the gateway is faithful. A team that can see, in its own baseline, that the gateway forwarded its requests unaltered and measured them accurately is a team that will accept enforcement in the next stage. Start by earning trust, then spend it.
VIII. Point solutions versus a gateway
An enterprise that does not consolidate will accumulate. A cost-tracking tool here, a prompt-logging library there, a separate redaction service, a homegrown router, a spreadsheet that reconciles provider invoices. Each was a reasonable purchase for one problem. Together they are a sprawl that no one owns end to end, with seams between every pair where traffic and policy fall through.
| Concern | Point solutions | Consolidated gateway |
|---|---|---|
| Cost control | Per-tool view, reconciled by hand, no enforcement | Attributed per workflow, enforced with limits, one view |
| Governance and audit | Logs scattered across tools, gaps at the seams | One complete record per request, under one policy |
| Observability | Each tool sees its slice only | All traffic, decomposed, in one place |
| Policy enforcement | Where each tool happens to sit | Single path, applied to every request |
| Maintenance | N integrations to own, version, and reconcile | One layer to operate and upgrade |
| Coverage of new apps | Each must be wired into each tool | Repoint one base URL, fully covered |
The hidden cost of point solutions is not their license fees. It is the integration surface between them, the traffic that slips through the gaps, and the standing question of which tool is authoritative when two disagree. A gateway is not one more tool in that set. It is the layer that makes most of them unnecessary, because the place they each wanted to sit is the place the gateway already is.
IX. Objections and honest answers
An architect who takes this seriously will raise three objections immediately. Each is legitimate, and each has a real answer rather than a slogan.
Latency: does another hop slow every call?
It adds a hop, and the honest answer is that the hop is small relative to model inference and is frequently outweighed by what the gateway removes. Model generation dominates end-to-end time; a co-located gateway adds low single-digit-millisecond overhead to authenticate, apply policy, and record. When the gateway also compresses context, the smaller prompt lowers time-to-first-token, so the net effect on a heavy workflow is often faster, not slower. The gateway should be deployed close to the application and the model, and its own overhead should be measured and published as part of the observe-only baseline.
Single point of failure: does centralizing create fragility?
Centralizing a path does concentrate it, and that is a real engineering responsibility, not a reason to avoid the pattern. The same objection was raised, and answered, for API gateways, load balancers, and identity providers, all of which an enterprise already runs as critical infrastructure. The answer is the same: run the gateway as stateless, horizontally scaled replicas behind a load balancer across availability zones, with health checks and an optional pass-through fail-open mode for non-policy-critical traffic. A gateway built to the standard of the API gateway beside it raises availability, because it is also where fallback and failover between model providers now live.
Vendor lock-in: does the gateway trap us?
This is the objection the design answers most directly. Because the gateway speaks the OpenAI-compatible model API rather than a proprietary protocol, the application is coupled to a standard interface, not to the gateway. Removing it is the same one-line base-URL change that adopted it. The lock-in worth worrying about is the opposite one: applications hard-wired to a single provider's endpoint and key across dozens of codebases. The gateway is what removes that lock-in, by making the provider a routing decision rather than a dependency compiled into every caller.
X. Conclusion
Enterprises did not put gateways in front of APIs, networks, and identity out of habit. They did it because uniform policy requires a single place to enforce it, and uniform records require a single place to capture them. LLM traffic is now production infrastructure carrying real cost, real data, and real risk, and it is the one critical dependency still flowing directly from application to model with nothing in between. A context gateway closes that gap. It is the model-agnostic control plane where control, compression, governance, and observability become properties of the architecture rather than projects in every application. It is adopted as a configuration change, rolled out without enforcing anything until it has earned trust, and removed as easily as it was added. The enterprises that place it early will not just spend less on AI. They will be the ones able to say, on any request, exactly what their systems sent, to which model, and under what policy.
References
- [1]NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023.
- [2]OWASP, Top 10 for Large Language Model Applications, 2025.
- [3]Microsoft, Azure Architecture Center: Azure OpenAI and gateway implementation guidance for generative AI workloads.
- [4]FinOps Foundation, FinOps Framework: Principles, Domains, and Capabilities.
- [5]Richardson and Ruby, RESTful Web Services, and the API gateway and control-plane pattern as described in the Microsoft Cloud Design Patterns catalog.

