Documentation
Introduction
Quanta is an Azure-native AI context gateway with a long-term memoryAPI. You point your application's model base URL at the gateway to compress context and cut token cost with no code changes, and you use the memory API to recall across millions of tokens of history while sending only a small, relevant slice to the model.
Two surfaces, one key:
| Field | Type | Description |
|---|---|---|
| Gateway | https://quanta.ryshe.com/api/v1 | OpenAI-compatible. Compresses + forwards to your model. |
| Memory | https://quanta.ryshe.com/api/memory | Ingest, recall, and evaluate long-term memory. |
Quickstart
1. Create an account at /signup and verify your email. 2. Your API key (sk-rsh-…) is shown after verification and emailed to you. 3. Start calling the API — it works immediately on the free plan.
# 1. Ingest a conversation
curl -s https://quanta.ryshe.com/api/memory/ingest \
-H "authorization: Bearer $QUANTA_KEY" \
-H "content-type: application/json" \
-d '{"title":"Project thread","messages":[
{"role":"user","content":"The launch date is March 14, 2027."}
]}'
# -> { "conversationId": "conv_…", "chunks": 1, "sourceTokens": 12, ... }
# 2. Ask about it later — answered from memory, not the full transcript
curl -s https://quanta.ryshe.com/api/memory/query \
-H "authorization: Bearer $QUANTA_KEY" \
-H "content-type: application/json" \
-d '{"conversationId":"conv_…","question":"When is the launch?"}'Authentication
Every request authenticates with your account API key as a Bearer token. Keys start with sk-rsh- and are issued after email verification. Keep them secret; treat a key like a password.
authorization: Bearer sk-rsh-xxxxxxxxxxxxxxxxxxxxxxxxRequests without a valid, verified key return 401. Each account's data is fully isolated — a key can only read and write its own conversations.
Gateway API
The gateway is OpenAI-compatible. Point your existing client at it by changing only the base URL; request and response shapes are unchanged. It compresses the prompt, forwards it to your model provider, streams the real response back, and reports the savings in response headers.
Chat completions
/api/v1/chat/completionsSend a standard chat-completions body. Provide your model-provider key in the x-provider-key header (it is forwarded, never stored). Omit a provider key to get a dry run — the compression report only, with nothing forwarded.
from openai import OpenAI
client = OpenAI(
base_url="https://quanta.ryshe.com/api/v1",
api_key="YOUR_PROVIDER_KEY", # your own OpenAI/Azure key
default_headers={"authorization": "Bearer $QUANTA_KEY"},
)
resp = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": "Summarize the quarterly report."}],
)Savings headers
Every gateway response includes the token accounting for that call:
| Field | Type | Description |
|---|---|---|
| x-quanta-tokens-before | int | Estimated tokens before compression. |
| x-quanta-tokens-after | int | Tokens after compression (what the model saw). |
| x-quanta-tokens-saved | int | Difference. |
| x-quanta-saved-pct | float | Percent reduction. |
| x-quanta-engine | string | Compression engine used (inline / headroom). |
| x-quanta-mode | string | dry-run, inline, or metered. |
Memory API
The memory API stores conversations in a vector-backed store and reconstructs only the relevant context for each question, within a fixed token budget. See How memory works.
Ingest a conversation
/api/memory/ingestCommit a transcript to memory synchronously. Accepts up to ~375k tokens (1.5M characters); for larger transcripts use the async endpoint below.
| Field | Type | Description |
|---|---|---|
| title | string | A label for the conversation. |
| messages | array? | [{ role, content }] turns. Provide this or transcript. |
| transcript | string? | Raw transcript text. Provide this or messages. |
{
"conversationId": "conv_ab12cd34",
"chunks": 80,
"sourceTokens": 1736,
"facts": 12,
"embedderLive": true
}Ingest asynchronously (large transcripts)
/api/memory/ingest-job/api/memory/ingest-job?id={jobId}For transcripts up to ~10M tokens. The POST returns a job id immediately and processing continues in the background; poll the GET endpoint for progress.
# start
{ "jobId": "job_…", "conversationId": "conv_…", "totalChunks": 3094, "status": "processing" }
# poll GET ?id=job_…
{ "status": "processing", "progress": 58.2, "processedChunks": 1800, "totalChunks": 3094 }
{ "status": "complete", "progress": 100, "facts": 41 }Query memory
/api/memory/query| Field | Type | Description |
|---|---|---|
| conversationId | string | The conversation to query. |
| question | string | The question to answer from memory. |
| budgetTokens | int? | Max tokens for the assembled context (default 1800). |
| episodicK | int? | Number of episodic chunks to retrieve (default 12). |
| factK | int? | Number of facts to retrieve (default 8). |
{
"answer": "The launch date is March 14, 2027.",
"grounded": true,
"mode": "live",
"naiveTokens": 50886, // cost of stuffing the full transcript
"promptTokens": 790, // what the model actually saw
"savedTokens": 50096,
"savedPct": 98.4,
"episodesUsed": 12, "factsUsed": 8, "recentUsed": 4
}Retention eval
/api/memory/evalMeasures recall: auto-generates needle questions from across the conversation, asks memory, and scores how many it recalled — your fidelity gate.
| Field | Type | Description |
|---|---|---|
| conversationId | string | The conversation to evaluate. |
| probes | int? | Number of probe questions (1–20, default 6). |
{
"recallPct": 100,
"recalled": 6, "total": 6,
"avgAssembledTokens": 801, "avgSavedPct": 98.7,
"results": [ { "question": "…", "expected": "…", "correct": true } ]
}List & delete conversations
/api/memory/conversations/api/memory/conversations?id={id}/api/memory/conversations?all=trueList your conversations with their footprint, delete one by id, or erase all of your memory (right to erasure). Deletes cascade to chunks, facts, and summaries.
Accounts & keys
Signup is self-serve and email-verified.
/api/account/signup/api/account/verify/api/account/usagesignup takes { email } and sends a verification link. Clicking it calls verify with the token, which mints and returns your key. usage (Bearer auth) returns your plan, keys, and month-to-date token usage and savings.
Plans, quotas & limits
Monthly quotas are enforced on tokens actually consumed (post-compression). When you exceed your plan, the API returns 402 until you upgrade or the month resets.
| Field | Type | Description |
|---|---|---|
| Free | 1,000,000 tokens / mo | Get started, evaluate, low-volume. |
| Starter | 25,000,000 tokens / mo | Production apps. |
| Pro | 100,000,000 tokens / mo | High volume. |
Rate limits
Per-account rate limits (token-bucket). Exceeding them returns 429 with a retry-after header.
| Field | Type | Description |
|---|---|---|
| Query | 120 / min | Memory queries. |
| Ingest | 12 / min | Ingestion (sync or async start). |
| Eval | 4 / min | Retention evaluations. |
Errors
Errors return JSON: { "error": "message" }.
| Field | Type | Description |
|---|---|---|
| 401 | Unauthorized | Missing, invalid, or unverified API key. |
| 402 | Quota reached | Monthly token quota for your plan is exhausted. |
| 404 | Not found | Conversation does not exist or is not yours. |
| 413 | Too large | Transcript exceeds the endpoint size cap. |
| 429 | Rate limited | Slow down; honor the retry-after header. |
| 503 | Not configured | A required backend (DB/model) is unavailable. |
Data & security
Quanta runs entirely on Microsoft Azure. Each account's data is tenant-isolated and enforced on every request. Secrets are held in Azure Key Vaultvia managed identity; data is encrypted at rest and in transit (TLS). You can delete any conversation or wipe your account's memory at any time. The full stack can also be deployed inside your own Azure subscription so data never leaves your tenant — contact us for the private-deployment option.
How memory works
You can't hold millions of tokens in a model's context window, so Quanta stores everything and reconstructs the relevant slice per question from four tiers, within your token budget:
| Field | Type | Description |
|---|---|---|
| Working set | verbatim | The most recent turns, kept exact. |
| Episodic | vector search | Every turn embedded; the relevant chunks are retrieved per question. |
| Semantic | facts | Durable extracted facts (decisions, numbers, names) survive summarization. |
| Digest | rolling summary | An always-current summary for global coherence. |
An assembler ranks these by relevance and recency, de-duplicates, and trims to your budget — so the model sees a small, faithful context instead of the whole history. The retention eval lets you measure that the right things were kept.
