QUANTA
by Ryshe
Quanta by Ryshe

Documentation

Introduction

Quanta is an Azure-native AI context gateway with a long-term memoryAPI. You point your application's model base URL at the gateway to compress context and cut token cost with no code changes, and you use the memory API to recall across millions of tokens of history while sending only a small, relevant slice to the model.

Two surfaces, one key:

FieldTypeDescription
Gatewayhttps://quanta.ryshe.com/api/v1OpenAI-compatible. Compresses + forwards to your model.
Memoryhttps://quanta.ryshe.com/api/memoryIngest, recall, and evaluate long-term memory.

Quickstart

1. Create an account at /signup and verify your email. 2. Your API key (sk-rsh-…) is shown after verification and emailed to you. 3. Start calling the API — it works immediately on the free plan.

Your first memory call (curl)
# 1. Ingest a conversation
curl -s https://quanta.ryshe.com/api/memory/ingest \
  -H "authorization: Bearer $QUANTA_KEY" \
  -H "content-type: application/json" \
  -d '{"title":"Project thread","messages":[
    {"role":"user","content":"The launch date is March 14, 2027."}
  ]}'
# -> { "conversationId": "conv_…", "chunks": 1, "sourceTokens": 12, ... }

# 2. Ask about it later — answered from memory, not the full transcript
curl -s https://quanta.ryshe.com/api/memory/query \
  -H "authorization: Bearer $QUANTA_KEY" \
  -H "content-type: application/json" \
  -d '{"conversationId":"conv_…","question":"When is the launch?"}'

Authentication

Every request authenticates with your account API key as a Bearer token. Keys start with sk-rsh- and are issued after email verification. Keep them secret; treat a key like a password.

authorization: Bearer sk-rsh-xxxxxxxxxxxxxxxxxxxxxxxx

Requests without a valid, verified key return 401. Each account's data is fully isolated — a key can only read and write its own conversations.

Gateway API

The gateway is OpenAI-compatible. Point your existing client at it by changing only the base URL; request and response shapes are unchanged. It compresses the prompt, forwards it to your model provider, streams the real response back, and reports the savings in response headers.

Chat completions

POST/api/v1/chat/completions

Send a standard chat-completions body. Provide your model-provider key in the x-provider-key header (it is forwarded, never stored). Omit a provider key to get a dry run — the compression report only, with nothing forwarded.

Python — only the base_url changes
from openai import OpenAI

client = OpenAI(
    base_url="https://quanta.ryshe.com/api/v1",
    api_key="YOUR_PROVIDER_KEY",   # your own OpenAI/Azure key
    default_headers={"authorization": "Bearer $QUANTA_KEY"},
)
resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "Summarize the quarterly report."}],
)

Savings headers

Every gateway response includes the token accounting for that call:

FieldTypeDescription
x-quanta-tokens-beforeintEstimated tokens before compression.
x-quanta-tokens-afterintTokens after compression (what the model saw).
x-quanta-tokens-savedintDifference.
x-quanta-saved-pctfloatPercent reduction.
x-quanta-enginestringCompression engine used (inline / headroom).
x-quanta-modestringdry-run, inline, or metered.

Memory API

The memory API stores conversations in a vector-backed store and reconstructs only the relevant context for each question, within a fixed token budget. See How memory works.

Ingest a conversation

POST/api/memory/ingest

Commit a transcript to memory synchronously. Accepts up to ~375k tokens (1.5M characters); for larger transcripts use the async endpoint below.

FieldTypeDescription
titlestringA label for the conversation.
messagesarray?[{ role, content }] turns. Provide this or transcript.
transcriptstring?Raw transcript text. Provide this or messages.
Response
{
  "conversationId": "conv_ab12cd34",
  "chunks": 80,
  "sourceTokens": 1736,
  "facts": 12,
  "embedderLive": true
}

Ingest asynchronously (large transcripts)

POST/api/memory/ingest-job
GET/api/memory/ingest-job?id={jobId}

For transcripts up to ~10M tokens. The POST returns a job id immediately and processing continues in the background; poll the GET endpoint for progress.

Start, then poll
# start
{ "jobId": "job_…", "conversationId": "conv_…", "totalChunks": 3094, "status": "processing" }
# poll GET ?id=job_…
{ "status": "processing", "progress": 58.2, "processedChunks": 1800, "totalChunks": 3094 }
{ "status": "complete", "progress": 100, "facts": 41 }

Query memory

POST/api/memory/query
FieldTypeDescription
conversationIdstringThe conversation to query.
questionstringThe question to answer from memory.
budgetTokensint?Max tokens for the assembled context (default 1800).
episodicKint?Number of episodic chunks to retrieve (default 12).
factKint?Number of facts to retrieve (default 8).
Response — note the token accounting
{
  "answer": "The launch date is March 14, 2027.",
  "grounded": true,
  "mode": "live",
  "naiveTokens": 50886,     // cost of stuffing the full transcript
  "promptTokens": 790,      // what the model actually saw
  "savedTokens": 50096,
  "savedPct": 98.4,
  "episodesUsed": 12, "factsUsed": 8, "recentUsed": 4
}

Retention eval

POST/api/memory/eval

Measures recall: auto-generates needle questions from across the conversation, asks memory, and scores how many it recalled — your fidelity gate.

FieldTypeDescription
conversationIdstringThe conversation to evaluate.
probesint?Number of probe questions (1–20, default 6).
Response
{
  "recallPct": 100,
  "recalled": 6, "total": 6,
  "avgAssembledTokens": 801, "avgSavedPct": 98.7,
  "results": [ { "question": "…", "expected": "…", "correct": true } ]
}

List & delete conversations

GET/api/memory/conversations
DELETE/api/memory/conversations?id={id}
DELETE/api/memory/conversations?all=true

List your conversations with their footprint, delete one by id, or erase all of your memory (right to erasure). Deletes cascade to chunks, facts, and summaries.

Accounts & keys

Signup is self-serve and email-verified.

POST/api/account/signup
POST/api/account/verify
GET/api/account/usage

signup takes { email } and sends a verification link. Clicking it calls verify with the token, which mints and returns your key. usage (Bearer auth) returns your plan, keys, and month-to-date token usage and savings.

Plans, quotas & limits

Monthly quotas are enforced on tokens actually consumed (post-compression). When you exceed your plan, the API returns 402 until you upgrade or the month resets.

FieldTypeDescription
Free1,000,000 tokens / moGet started, evaluate, low-volume.
Starter25,000,000 tokens / moProduction apps.
Pro100,000,000 tokens / moHigh volume.

Rate limits

Per-account rate limits (token-bucket). Exceeding them returns 429 with a retry-after header.

FieldTypeDescription
Query120 / minMemory queries.
Ingest12 / minIngestion (sync or async start).
Eval4 / minRetention evaluations.

Errors

Errors return JSON: { "error": "message" }.

FieldTypeDescription
401UnauthorizedMissing, invalid, or unverified API key.
402Quota reachedMonthly token quota for your plan is exhausted.
404Not foundConversation does not exist or is not yours.
413Too largeTranscript exceeds the endpoint size cap.
429Rate limitedSlow down; honor the retry-after header.
503Not configuredA required backend (DB/model) is unavailable.

Data & security

Quanta runs entirely on Microsoft Azure. Each account's data is tenant-isolated and enforced on every request. Secrets are held in Azure Key Vaultvia managed identity; data is encrypted at rest and in transit (TLS). You can delete any conversation or wipe your account's memory at any time. The full stack can also be deployed inside your own Azure subscription so data never leaves your tenant — contact us for the private-deployment option.

How memory works

You can't hold millions of tokens in a model's context window, so Quanta stores everything and reconstructs the relevant slice per question from four tiers, within your token budget:

FieldTypeDescription
Working setverbatimThe most recent turns, kept exact.
Episodicvector searchEvery turn embedded; the relevant chunks are retrieved per question.
SemanticfactsDurable extracted facts (decisions, numbers, names) survive summarization.
Digestrolling summaryAn always-current summary for global coherence.

An assembler ranks these by relevance and recency, de-duplicates, and trims to your budget — so the model sees a small, faithful context instead of the whole history. The retention eval lets you measure that the right things were kept.

Need something not covered here? Reach us at hello@ryshe.com or book a briefing.