The enterprise AI gateway that speaks LLM, MCP, and A2A — unified control plane with a policy engine that learns, a semantic cache that thinks, and observability that tells you exactly what happened and what it cost.
Semantic similarity caching using embeddings. Not exact-match — conceptually equivalent prior answers are served from cache, collapsing redundant LLM calls across rephrased queries.
Guardrail decisions made by an AI reading your actual compliance policies — not a keyword blocklist. Thresholds adapt over time from your organisation's real compliance outcomes.
One gateway. One audit trail. One policy engine. All three protocols share identity, budget enforcement, compliance logging, and semantic caching infrastructure.
POST /v1/chat/completions works with zero code changes. Streaming (text/event-stream), function/tool calling, vision, and extended context are all supported.
client = OpenAI(base_url="https://your-smartflow/v1", api_key="sk-sf-...")
Every existing OpenAI SDK integration gains audit logging, policy enforcement, cost tracking, and 4-phase semantic caching without a single code change.
ANTHROPIC_BASE_URL to your Smartflow instance and use the Anthropic Python/TypeScript SDK unchanged. Requests to POST /anthropic/v1/messages are forwarded natively to api.anthropic.com/v1/messageswith all headers preserved —x-api-key, anthropic-version, anthropic-beta. All Smartflow features apply transparently.
client = Anthropic(base_url="https://your-smartflow/anthropic", api_key="sk-sf-...")
ANTHROPIC_BASE_URL, nothing else changes. The [1m] extended-context suffix Claude Code appends to model names is stripped automatically before forwarding.
Every Anthropic SDK integration — Claude Code, Claude Desktop, custom agents — gets centralised audit logging, policy enforcement, spend budgeting, and semantic caching with no client changes. This is the same zero-change story as the OpenAI drop-in, applied to the native Anthropic Messages API.
gpt-*, o1/o3/o4-*, dall-e-*, whisper-*→ OpenAIclaude-*→ Anthropicgemini-*→ Googlegrok-*→ xAImistral-*, mixtral-*→ Mistralcommand-*→ Cohereollama/*→ local Ollama instancedeepseek/*, groq/*, openrouter/*→ respective providersmodel: "anthropic/claude-sonnet-4-6" to /v1/chat/completions and Smartflow strips the prefix and routes to Anthropic correctly.
OLLAMA_BASE_URL, use any model tag, zero API key requireddeepseek/* model prefix/v1/chat/completionsAir-gapped or privacy-first deployments where all traffic stays on-premises. Organisations with on-prem GPU infrastructure route sensitive workloads to local models while lower-sensitivity traffic routes to cloud providers — all through one proxy with unified policy enforcement.
Lower provider costs on every cache miss, not just cache hits. A chat application with a 2,000-token system prompt pays for ~700 tokens after compression, on every forwarded request, with no client changes.
sk-sf-...) carry hard spend limits scoped to teams, models, or use cases. Budget periods: daily, weekly, monthly, lifetime. Budget is checked before every request — if exceeded, the request returns429 before any provider cost is incurred. Spend is recorded after each response using actual token cost.
Per-user, per-team, and per-application cost caps with zero-tolerance enforcement. Issue keys to internal users or external partners with guaranteed spend control.
sentence-transformers/all-MiniLM-L6-v2 (384-dim, local inference), then K-nearest-neighbours search against all stored response embeddings in Redis Stack using cosine similarity. Threshold: 0.90. Catches paraphrased questions — the most common real-world pattern missed by all prior phases.A user asking "what are the side effects of ibuprofen?" and another asking "can ibuprofen cause stomach problems?" resolve to the same cached response at Phase 4 (similarity ≈ 0.92). Exact-match caches and fingerprint caches miss this entirely. With real-world query traffic, Phase 4 hits represent the largest portion of cache savings after initial warm-up.
Dramatically lower provider costs on repeated-topic workloads — support bots, internal Q&A, knowledge base tools. Phase 4 hit latency is 5–20ms (local BERT + Redis KNN), versus hundreds of milliseconds for a live LLM call. No application changes required.
Cache-Control: no-cache— bypass read | no-store— bypass writex-smartflow-cache-ttl: 3600— override TTL | x-smartflow-cache-namespace: team-a— scope to a logical partitionx-smartflow-cache-hit: true and x-smartflow-cache-key for client-side correlation.
Mix cacheable and non-cacheable calls in the same integration. Real-time lookups that must never be stale coexist with deterministic queries that benefit from caching — controlled per-request, not per-route.
Static keyword blocklists block "kill process" in a DevOps context but miss subtle policy violations in legal or medical text. Maestro reads the policy the same way a compliance officer would and makes contextual judgements. It gets better as your team reviews its decisions.
Compliance enforcement that improves over time. Zero false positives from keyword collisions. A policy engine that matches the nuance of your actual compliance requirements rather than a generic vendor baseline.
pii, toxicity, prompt_injection, compliance, custom) are grouped into named policies. Policies inherit from a parent and override specific guardrails — no duplication. Policies attach to scopes: team, virtual-key alias, model pattern, or tag wildcard (e.g.hipaa-*). Every proxied request returns headers listing which policies matched and why.
Real-time event-driven MCP servers, local CLI tools, and standard community servers (GitHub CLI, filesystem access, local databases) can all be registered and used through the same gateway.
allowed_tools (whitelist), disallowed_tools (blacklist, overrides whitelist), and allowed_params (per-tool parameter allow-lists). Requests that call disallowed tools or pass disallowed parameters are rejected at the gateway before reaching the MCP server. available_on_public_internet: false blocks requests from non-RFC-1918 IPs entirely.
Fine-grained least-privilege enforcement for every MCP server — no trust required at the MCP server level. Public-facing deployments with private MCP infrastructure are safely isolated.
x-mcp-query: summarize a PDF on a tools/listrequest returns only semantically relevant tools — not the full catalogue. Agents discover capabilities byintent, not by knowing server names.
x-mcp-{alias}-{header-name} headers. Smartflow extracts and forwards them only to the intended server and strips them before forwarding to the end LLM. User-specific or request-specific credentials (session tokens, scoped API keys) can be forwarded to MCP servers without storing them centrally. No credential leakage between servers.
Compliance coverage for both inbound tool requests and outbound tool responses — catching violations at both ends of the MCP call without requiring changes to the MCP server itself.
Cross-framework agent collaboration. LangGraph agents talk to Pydantic AI agents through Smartflow. Task chains span services with full traceability via X-A2A-Trace-Id. Task history is persisted in Redis for replay and audit.
High-availability LLM routing with no single point of failure. Multi-provider redundancy configurable per model or use case without application-level changes.
SMARTFLOW_ROUTING_STRATEGY=latency) — rolling p95 EMA tracked per provider in Redis; requests route to the fastest live provider.strategy=tag) —x-smartflow-tags header matched against per-provider capability tags in Redis.SMARTFLOW_PROVIDER_BUDGETS=openai:100,anthropic:50) — provider is skipped when its daily spend cap is reached; fallback chain takes over automatically.id_token, extracts Entra group memberships and App Role claims, and automatically creates or updates Smartflow teams in Redis. Users are added to teams they belong to and removed from teams they have left. App Role values map to internal roles: proxy_admin, org_admin, proxy_admin_viewer, internal_user. Access controls, budgets, and guardrail policies attached to teams take effect immediately when membership changes in Entra.
Zero-touch team provisioning from Entra ID. No manual group-to-team mapping. Spend limits and compliance policies follow group membership automatically.
GET /metrics exposes text-format Prometheus metrics: per-provider daily spend, per-provider rolling p95 latency, MCP call counts and costs by server, vector store count, and version info. Scrape directly into any Prometheus + Grafana stack.
x-smartflow-call-id— unique trace ID | x-smartflow-response-cost— USD cost | x-smartflow-cache-hit— true/false | x-smartflow-duration-ms— end-to-end latency | x-smartflow-provider— which provider served the responseSLACK_WEBHOOK_URL, TEAMS_WEBHOOK_URL, DISCORD_WEBHOOK_URL. Alerts are non-blocking and do not add latency to the request path.
Complete per-user, per-request audit trail for compliance reporting. Every LLM call, every MCP tool invocation, and every A2A task is logged with the identity of who triggered it and which policies applied.
| # | Feature | Area |
|---|---|---|
| 1 | OpenAI drop-in— zero-change replacement for any OpenAI SDK client (/v1/chat/completions) | LLM Proxy |
| 2 | Anthropic drop-in— native Messages API passthrough (/anthropic/v1/messages); set ANTHROPIC_BASE_URL and go | LLM Proxy |
| 3 | Provider auto-routing by model name (gpt, claude, gemini, grok, mistral, command, ollama, deepseek…) | LLM Proxy |
| 4 | Cursor IDE + Claude Code + Claude Desktop passthrough (zero client changes) | LLM Proxy |
| 5 | Local model support — Ollama, GGUF/ONNX, DeepSeek self-hosted, vLLM, LM Studio | LLM Proxy |
| 6 | In-flight prompt compression — 39-pattern verbose reduction + semantic dedup (~65% ratio) | Caching |
| 7 | 4-Phase MetaCache— intent fingerprint → near-miss → exact key → VectorLite BERT KNN (all-MiniLM-L6-v2, 384-dim, ≥ 0.90) | MetaCache |
| 8 | Per-request cache controls (no-cache, no-store, ttl, namespace) | Caching |
| 9 | Transparent LLM-side prompt cache injection (Anthropic ephemeral, OpenAI prefix caching) | Caching |
| 10 | Virtual keys with spend budgets (daily / weekly / monthly / lifetime) | Key Mgmt |
| 11 | Provider key vault — raw credentials stored server-side, never exposed to clients | Key Mgmt |
| 12 | Fallback chains with per-step retry and exponential backoff | Routing |
| 13 | Latency-based (p95 EMA), tag-based, and cost-based provider routing | Routing |
| 14 | Per-provider daily budget caps with automatic failover | Routing |
| 15 | AI Policy Engine (Maestro)— learning-based guardrails evaluated by AI, not regex | Policy |
| 16 | Policy groups with parent inheritance + tag-wildcard scoping | Policy |
| 17 | Guardrail policy response headers on every request | Policy |
| 18 | MCP HTTP / SSE / STDIO transports | MCP |
| 19 | Per-server tool allow/deny lists + parameter allow-lists | MCP |
| 20 | Semantic tool filtering via embedding index | MCP |
| 21 | Built-in vector stores + RAG pipeline (no external vector DB) | RAG |
| 22 | MCP server aliases + per-alias routing | MCP |
| 23 | OAuth Client Credentials auto-refresh for MCP | MCP Auth |
| 24 | OAuth PKCE per-user browser consent | MCP Auth |
| 25 | Per-request auth header forwarding to MCP servers | MCP Auth |
| 26 | Public internet IP gating per MCP server | MCP |
| 27 | MCP guardrail modes (PreCall / DuringCall / Disabled) | MCP Policy |
| 28 | MCP cost tracking by server / user / tool | MCP Obs |
| 29 | A2A agent gateway (Google A2A open protocol) | A2A |
| 30 | Agent Cards + task streaming via SSE | A2A |
| 31 | Cross-agent tracing (X-A2A-Trace-Id) | A2A |
| 32 | Microsoft Entra ID SSO + zero-touch group sync | Identity |
| 33 | Prometheus /metrics endpoint | Observability |
| 34 | Standardised response headers (cost, trace, cache, latency, provider) | Observability |
| 35 | Slack / Teams / Discord alerting webhooks | Alerting |
| 36 | VAS log audit trail with compliance dashboard + Q&A replay | Compliance |
| 37 | Kubernetes / Helm deployment (cert-manager TLS, NGINX ingress, horizontal scaling) | Deployment |