Quanta by Ryshe

Documentation

Introduction

Quanta is an Azure-native AI context gateway with a long-term memoryAPI. You point your application's model base URL at the gateway to compress context and cut token cost with no code changes, and you use the memory API to recall across millions of tokens of history while sending only a small, relevant slice to the model.

Two surfaces, one key:

Field	Type	Description
Gateway	https://quanta.ryshe.com/api/v1	OpenAI-compatible. Compresses + forwards to your model.
Memory	https://quanta.ryshe.com/api/memory	Ingest, recall, and evaluate long-term memory.

Quickstart

1. Create an account at /signup and verify your email. 2. Your API key (sk-rsh-…) is shown after verification and emailed to you. 3. Start calling the API — it works immediately on the free plan.

Your first memory call (curl)

# 1. Ingest a conversation
curl -s https://quanta.ryshe.com/api/memory/ingest \
  -H "authorization: Bearer $QUANTA_KEY" \
  -H "content-type: application/json" \
  -d '{"title":"Project thread","messages":[
    {"role":"user","content":"The launch date is March 14, 2027."}
  ]}'
# -> { "conversationId": "conv_…", "chunks": 1, "sourceTokens": 12, ... }

# 2. Ask about it later — answered from memory, not the full transcript
curl -s https://quanta.ryshe.com/api/memory/query \
  -H "authorization: Bearer $QUANTA_KEY" \
  -H "content-type: application/json" \
  -d '{"conversationId":"conv_…","question":"When is the launch?"}'

Authentication

Every request authenticates with your account API key as a Bearer token. Keys start with sk-rsh- and are issued after email verification. Keep them secret; treat a key like a password.

authorization: Bearer sk-rsh-xxxxxxxxxxxxxxxxxxxxxxxx

Requests without a valid, verified key return 401. Each account's data is fully isolated — a key can only read and write its own conversations.

Gateway API

The gateway is OpenAI-compatible. Point your existing client at it by changing only the base URL; request and response shapes are unchanged. It compresses the prompt, forwards it to your model provider, streams the real response back, and reports the savings in response headers.

Chat completions

POST/api/v1/chat/completions

Send a standard chat-completions body. Provide your model-provider key in the x-provider-key header (it is forwarded, never stored). Omit a provider key to get a dry run — the compression report only, with nothing forwarded.

Python — only the base_url changes

from openai import OpenAI

client = OpenAI(
    base_url="https://quanta.ryshe.com/api/v1",
    api_key="YOUR_PROVIDER_KEY",   # your own OpenAI/Azure key
    default_headers={"authorization": "Bearer $QUANTA_KEY"},
)
resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Summarize the quarterly report."}],
)

Savings headers

Every gateway response includes the token accounting for that call:

Field	Type	Description
x-quanta-tokens-before	int	Estimated tokens before compression.
x-quanta-tokens-after	int	Tokens after compression (what the model saw).
x-quanta-tokens-saved	int	Difference.
x-quanta-saved-pct	float	Percent reduction.
x-quanta-engine	string	Compression engine used (inline / headroom).
x-quanta-mode	string	dry-run, inline, or metered.

Memory API

The memory API stores conversations in a vector-backed store and reconstructs only the relevant context for each question, within a fixed token budget. See How memory works.

Ingest a conversation

POST/api/memory/ingest

Commit a transcript to memory synchronously. Accepts up to ~375k tokens (1.5M characters); for larger transcripts use the async endpoint below.

Field	Type	Description
title	string	A label for the conversation.
messages	array?	[{ role, content }] turns. Provide this or transcript.
transcript	string?	Raw transcript text. Provide this or messages.

Response

{
  "conversationId": "conv_ab12cd34",
  "chunks": 80,
  "sourceTokens": 1736,
  "facts": 12,
  "embedderLive": true
}

Ingest asynchronously (large transcripts)

POST/api/memory/ingest-job

GET/api/memory/ingest-job?id={jobId}

For transcripts up to ~10M tokens. The POST returns a job id immediately and processing continues in the background; poll the GET endpoint for progress.

Start, then poll

# start
{ "jobId": "job_…", "conversationId": "conv_…", "totalChunks": 3094, "status": "processing" }
# poll GET ?id=job_…
{ "status": "processing", "progress": 58.2, "processedChunks": 1800, "totalChunks": 3094 }
{ "status": "complete", "progress": 100, "facts": 41 }

Query memory

POST/api/memory/query

Field	Type	Description
conversationId	string	The conversation to query.
question	string	The question to answer from memory.
budgetTokens	int?	Max tokens for the assembled context (default 1800).
episodicK	int?	Number of episodic chunks to retrieve (default 12).
factK	int?	Number of facts to retrieve (default 8).

Response — note the token accounting

{
  "answer": "The launch date is March 14, 2027.",
  "grounded": true,
  "mode": "live",
  "naiveTokens": 50886,     // cost of stuffing the full transcript
  "promptTokens": 790,      // what the model actually saw
  "savedTokens": 50096,
  "savedPct": 98.4,
  "episodesUsed": 12, "factsUsed": 8, "recentUsed": 4
}

Retention eval

POST/api/memory/eval

Measures recall: auto-generates needle questions from across the conversation, asks memory, and scores how many it recalled — your fidelity gate.

Field	Type	Description
conversationId	string	The conversation to evaluate.
probes	int?	Number of probe questions (1–20, default 6).

Response

{
  "recallPct": 100,
  "recalled": 6, "total": 6,
  "avgAssembledTokens": 801, "avgSavedPct": 98.7,
  "results": [ { "question": "…", "expected": "…", "correct": true } ]
}

List & delete conversations

GET/api/memory/conversations

DELETE/api/memory/conversations?id={id}

DELETE/api/memory/conversations?all=true

List your conversations with their footprint, delete one by id, or erase all of your memory (right to erasure). Deletes cascade to chunks, facts, and summaries.

Accounts & keys

Signup is self-serve and email-verified.

POST/api/account/signup

POST/api/account/verify

GET/api/account/usage

signup takes { email } and sends a verification link. Clicking it calls verify with the token, which mints and returns your key. usage (Bearer auth) returns your plan, keys, and month-to-date token usage and savings.

Plans, quotas & limits

Monthly quotas are enforced on tokens actually consumed (post-compression). When you exceed your plan, the API returns 402 until you upgrade or the month resets.

Field	Type	Description
Free	1,000,000 tokens / mo	Get started, evaluate, low-volume.
Starter	25,000,000 tokens / mo	Production apps.
Pro	100,000,000 tokens / mo	High volume.

Rate limits

Per-account rate limits (token-bucket). Exceeding them returns 429 with a retry-after header.

Field	Type	Description
Query	120 / min	Memory queries.
Ingest	12 / min	Ingestion (sync or async start).
Eval	4 / min	Retention evaluations.

Errors

Errors return JSON: { "error": "message" }.

Field	Type	Description
401	Unauthorized	Missing, invalid, or unverified API key.
402	Quota reached	Monthly token quota for your plan is exhausted.
404	Not found	Conversation does not exist or is not yours.
413	Too large	Transcript exceeds the endpoint size cap.
429	Rate limited	Slow down; honor the retry-after header.
503	Not configured	A required backend (DB/model) is unavailable.

Data & security

Quanta runs entirely on Microsoft Azure. Each account's data is tenant-isolated and enforced on every request. Secrets are held in Azure Key Vaultvia managed identity; data is encrypted at rest and in transit (TLS). You can delete any conversation or wipe your account's memory at any time. The full stack can also be deployed inside your own Azure subscription so data never leaves your tenant — contact us for the private-deployment option.

How memory works

You can't hold millions of tokens in a model's context window, so Quanta stores everything and reconstructs the relevant slice per question from four tiers, within your token budget:

Field	Type	Description
Working set	verbatim	The most recent turns, kept exact.
Episodic	vector search	Every turn embedded; the relevant chunks are retrieved per question.
Semantic	facts	Durable extracted facts (decisions, numbers, names) survive summarization.
Digest	rolling summary	An always-current summary for global coherence.

An assembler ranks these by relevance and recency, de-duplicates, and trims to your budget — so the model sees a small, faithful context instead of the whole history. The retention eval lets you measure that the right things were kept.

Need something not covered here? Reach us at hello@ryshe.com or book a briefing.