Product

One control plane for all your inference ops

Retanu sits between your apps and the inference providers — picking the right model, enforcing budgets, and tracking every request automatically.

Automatic model selection

Every request is classified as simple or complex. Simple tasks go to cost-efficient models. Complex tasks go to stronger ones. The system picks the cheapest provider in the quality tier you set, and fails over automatically if a provider is down or slow.

3 quality tiers — quality-first, balanced, cost-optimized
Fully automatic — classify, route, and failover on every request
~2 s median response — measured end-to-end
4 providers — OpenAI, Anthropic, DeepInfra, Google

# Incoming request

classify(prompt) → simple task

tier: cost-optimized

route → DeepInfra / Qwen3 32B

# If DeepInfra is down

failover → Google / Gemini 2.5 Flash

✓ 200 OK — ~2 s

Per-workspace configuration

# acme-corp workspace config

tier: balanced

providers: [deepinfra, openai, anthropic, google]

spend_cap: $500/mo

rate_limit: 100 req/min

data_retention: none

model_allowlist: [Qwen3 32B, GPT-4o mini, Claude Haiku, Gemini 2.5 Flash]

Each workspace gets its own configuration — quality tier, provider preferences, budget cap, and data policy. Change a setting and the next request uses the new rules. No redeploy, no code change.

Per-workspace controls — tier, providers, budget, limits, model lists
Instant changes — new settings apply on the next request
Isolated credentials — one endpoint and key per workspace

Budget and safety controls

Spend caps

Set a monthly budget per workspace. When it's hit, requests are blocked — not just logged. 98% budget accuracy.

Rate limits

Per-workspace request limits enforced before any inference call is made. Prevents runaway usage from a single app or team.

Fail-closed design

If the budget check can't run (database issue, network problem), the request is rejected. No silent overspend.

Usage tracking and reporting

Every request is tracked with token counts and cost. Usage rolls up into per-workspace reports you can generate with your own rates applied. You use your own provider keys — providers bill you, and you can re-bill workspaces at your own rates.

Real-time spend

See costs per workspace as requests flow through — not after the monthly bill.

Custom reports

Generate usage reports with your own rate card and markup applied.

Your keys, your pricing

You connect your own provider keys. Provider charges land on your account. You decide how to price it for your teams or clients.

Data isolation

Every organization's data is separated at the database level using row-level security. One organization cannot access another's data, keys, or logs. This isolation is verified by 16 automated tests that run continuously.

Database-enforced isolation

Security policies on every data table. The application cannot bypass them — it's enforced by the database itself.

Separate credentials

Each workspace gets its own API key and endpoint. Keys, configuration, and logs never cross organization boundaries.

Drop-in API

Retanu uses the same API format as OpenAI. If your app already calls OpenAI, it works with Retanu — just change the URL and key. No code rewrite needed.

OpenAI-compatible — same request/response format
Request tagging — add metadata to track usage by department or project
Health checks — monitoring endpoints included
Standard auth — Bearer token scoped to each workspace

# Drop-in replacement

curl your-company.api.retanu.com/v1/workspace/chat/completions \

-H "Authorization: Bearer sk-..." \

-d '{"messages": [...],

"x_retanu": {"dept": "support"}}'