Version 1.4

Smartflow Platform Capabilities

The enterprise AI gateway that speaks LLM, MCP, and A2A — unified control plane with a policy engine that learns, a semantic cache that thinks, and observability that tells you exactly what happened and what it cost.

LLM Proxy MCP Gateway A2A Agent Gateway MetaCache AI Policy Engine

MetaCache

Semantic similarity caching using embeddings. Not exact-match — conceptually equivalent prior answers are served from cache, collapsing redundant LLM calls across rephrased queries.

AI Policy Engine

Guardrail decisions made by an AI reading your actual compliance policies — not a keyword blocklist. Thresholds adapt over time from your organisation's real compliance outcomes.

Unified LLM + MCP + A2A

One gateway. One audit trail. One policy engine. All three protocols share identity, budget enforcement, compliance logging, and semantic caching infrastructure.

LLM Proxy
1
OpenAI Drop-In — Universal EndpointZERO CHANGES
LLM Proxy
Complete drop-in replacement for the OpenAI API. Any client SDK, LangChain app, LlamaIndex pipeline, or tool that targets POST /v1/chat/completions works with zero code changes. Streaming (text/event-stream), function/tool calling, vision, and extended context are all supported.

client = OpenAI(base_url="https://your-smartflow/v1", api_key="sk-sf-...")
Unlocks

Every existing OpenAI SDK integration gains audit logging, policy enforcement, cost tracking, and 4-phase semantic caching without a single code change.

2
Anthropic Drop-In — Native Messages APIZERO CHANGES
LLM Proxy
Complete drop-in replacement for the Anthropic SDK. Set ANTHROPIC_BASE_URL to your Smartflow instance and use the Anthropic Python/TypeScript SDK unchanged. Requests to POST /anthropic/v1/messages are forwarded natively to api.anthropic.com/v1/messageswith all headers preserved —x-api-key, anthropic-version, anthropic-beta. All Smartflow features apply transparently.

client = Anthropic(base_url="https://your-smartflow/anthropic", api_key="sk-sf-...")

Claude Code and Claude Desktopconnect this way — setANTHROPIC_BASE_URL, nothing else changes. The [1m] extended-context suffix Claude Code appends to model names is stripped automatically before forwarding.
Unlocks

Every Anthropic SDK integration — Claude Code, Claude Desktop, custom agents — gets centralised audit logging, policy enforcement, spend budgeting, and semantic caching with no client changes. This is the same zero-change story as the OpenAI drop-in, applied to the native Anthropic Messages API.

3
Provider Auto-Routing by Model Name
LLM Proxy
Model name drives provider selection automatically — no per-client configuration:
  • gpt-*, o1/o3/o4-*, dall-e-*, whisper-*→ OpenAI
  • claude-*→ Anthropic
  • gemini-*→ Google
  • grok-*→ xAI
  • mistral-*, mixtral-*→ Mistral
  • command-*→ Cohere
  • ollama/*→ local Ollama instance
  • deepseek/*, groq/*, openrouter/*→ respective providers

Model prefix syntax works on any endpoint: send model: "anthropic/claude-sonnet-4-6" to /v1/chat/completions and Smartflow strips the prefix and routes to Anthropic correctly.
4
Local Model Support
LLM Proxy
Route to local inference without sending data to any cloud provider. All Smartflow features — compliance, caching, VAS logging — apply identically to local model traffic.
  • Ollama— setOLLAMA_BASE_URL, use any model tag, zero API key required
  • Local GGUF / ONNX— any OpenAI-compatible local inference server
  • DeepSeek self-hosted— viadeepseek/* model prefix
  • HuggingFace / vLLM / LM Studio— any server exposing/v1/chat/completions
Unlocks

Air-gapped or privacy-first deployments where all traffic stays on-premises. Organisations with on-prem GPU infrastructure route sensitive workloads to local models while lower-sensitivity traffic routes to cloud providers — all through one proxy with unified policy enforcement.

5
In-Flight Prompt Compression
Caching
On every cache miss, outgoing request bodies are compressed before forwarding. Completely transparent — clients send and receive standard API payloads unchanged.
  • Verbose phrase reduction— 39 patterns replaced automatically ("In order to" → "To"). 20–30% token reduction on prose-heavy system prompts.
  • Semantic deduplication— repeated concepts across message history detected via embeddings and replaced with compact references, resolved transparently on response.
  • Provider-aware— aggressiveness calibrated per provider. Token savings tracked per request in VAS logs.
Production result: ~65% average compression ratio.
Unlocks

Lower provider costs on every cache miss, not just cache hits. A chat application with a 2,000-token system prompt pays for ~700 tokens after compression, on every forwarded request, with no client changes.

6
Virtual Keys with Spend Budgets
Key Management
Smartflow-issued virtual keys (sk-sf-...) carry hard spend limits scoped to teams, models, or use cases. Budget periods: daily, weekly, monthly, lifetime. Budget is checked before every request — if exceeded, the request returns429 before any provider cost is incurred. Spend is recorded after each response using actual token cost.
Unlocks

Per-user, per-team, and per-application cost caps with zero-tolerance enforcement. Issue keys to internal users or external partners with guaranteed spend control.

MetaCache — 4-Phase Semantic Cache
4
4-Phase Semantic Cache with VectorLite BERT KNN UNIQUE
MetaCache
Every request passes through four lookup phases before any token is sent to a provider. The first hit wins.

Phase 1— Intent fingerprint exact match (in-process, <1ms). Catches repeated questions regardless of minor wording variation.
Phase 2— Intent fingerprint near-miss via token-overlap scoring. Catches reformulations of the same core question.
Phase 3— SHA-256 exact key lookup in Redis (1–3ms). Reliable for structured payloads, multimodal inputs, and code prompts.
Phase 4— VectorLite BERT KNN: request embedded withsentence-transformers/all-MiniLM-L6-v2 (384-dim, local inference), then K-nearest-neighbours search against all stored response embeddings in Redis Stack using cosine similarity. Threshold: 0.90. Catches paraphrased questions — the most common real-world pattern missed by all prior phases.

Responses are stored with their BERT embeddings at write time. No external API call, no Qdrant, no Weaviate — BERT inference runs natively inside the proxy process.
Why This Matters

A user asking "what are the side effects of ibuprofen?" and another asking "can ibuprofen cause stomach problems?" resolve to the same cached response at Phase 4 (similarity ≈ 0.92). Exact-match caches and fingerprint caches miss this entirely. With real-world query traffic, Phase 4 hits represent the largest portion of cache savings after initial warm-up.

Unlocks

Dramatically lower provider costs on repeated-topic workloads — support bots, internal Q&A, knowledge base tools. Phase 4 hit latency is 5–20ms (local BERT + Redis KNN), versus hundreds of milliseconds for a live LLM call. No application changes required.

5
Per-Request Cache Controls
Caching
Callers control caching behaviour on individual requests without changing server configuration:
Cache-Control: no-cache— bypass read  | no-store— bypass write
x-smartflow-cache-ttl: 3600— override TTL  | x-smartflow-cache-namespace: team-a— scope to a logical partition

Every cached response returns x-smartflow-cache-hit: true and x-smartflow-cache-key for client-side correlation.
Unlocks

Mix cacheable and non-cacheable calls in the same integration. Real-time lookups that must never be stale coexist with deterministic queries that benefit from caching — controlled per-request, not per-route.

AI Policy Engine — Maestro
6
Learning-Based Guardrails UNIQUE
Policy Engine
Maestro runs as a pre-call and post-call validation pass. It reads your organisation's compliance policies (stored as documents in the Policy Perfect API), embeds them, and evaluates each request against the semantic intentof those policies — not surface-level keyword matching.

The Policy Perfect API maintains a living corpus of compliance decisions. Every flagged or blocked request outcome is fed back into the policy model. Guardrail thresholds adapt over time based on your organisation's actual decisions — not a vendor's preset calibration.
Why This Matters

Static keyword blocklists block "kill process" in a DevOps context but miss subtle policy violations in legal or medical text. Maestro reads the policy the same way a compliance officer would and makes contextual judgements. It gets better as your team reviews its decisions.

Unlocks

Compliance enforcement that improves over time. Zero false positives from keyword collisions. A policy engine that matches the nuance of your actual compliance requirements rather than a generic vendor baseline.

7
Policy Groups with Inheritance + Tag-Wildcard Scoping
Policy
Named guardrails (pii, toxicity, prompt_injection, compliance, custom) are grouped into named policies. Policies inherit from a parent and override specific guardrails — no duplication. Policies attach to scopes: team, virtual-key alias, model pattern, or tag wildcard (e.g.hipaa-*). Every proxied request returns headers listing which policies matched and why.
Endpoints
  • POST /api/guardrails, GET /api/guardrails
  • POST /api/policies, GET /api/policies/{name}
  • POST /api/policies/attachments
  • POST /api/policies/resolve — preview guardrails for a context
MCP Gateway
8
HTTP, SSE, and STDIO Transports
MCP Gateway
Every MCP transport type is supported. HTTP— standard JSON-RPC 2.0 over HTTPS.SSE— proxy opens and maintains the Server-Sent Events stream; events are parsed and routed transparently.STDIO— Smartflow spawns the local child process, communicates over stdin/stdout using JSON-RPC 2.0, and manages the process lifecycle automatically.
Unlocks

Real-time event-driven MCP servers, local CLI tools, and standard community servers (GitHub CLI, filesystem access, local databases) can all be registered and used through the same gateway.

9
Per-Server Tool Access Control
MCP Gateway
Each MCP server carries a fine-grained access policy: allowed_tools (whitelist), disallowed_tools (blacklist, overrides whitelist), and allowed_params (per-tool parameter allow-lists). Requests that call disallowed tools or pass disallowed parameters are rejected at the gateway before reaching the MCP server. available_on_public_internet: false blocks requests from non-RFC-1918 IPs entirely.
Unlocks

Fine-grained least-privilege enforcement for every MCP server — no trust required at the MCP server level. Public-facing deployments with private MCP infrastructure are safely isolated.

10
Semantic Tool Filtering
MCP Gateway
All tools across all registered servers are indexed with embeddings of their name, description, and parameter signatures. Passing x-mcp-query: summarize a PDF on a tools/listrequest returns only semantically relevant tools — not the full catalogue. Agents discover capabilities byintent, not by knowing server names.
Endpoints
  • GET /api/mcp/tools/search?q=read+a+file&k=5
  • POST /api/mcp/tools/reindex
Vector Stores & RAG
11
Built-In Vector Store API UNIQUE
Vector / RAG
OpenAI-compatible vector store API backed by Redis and EmbeddingService — no external vector database required. CRUD for stores, automatic chunk-and-embed on file ingest, and top-K semantic search. The RAG pipeline endpoints compose retrieval-augmented generation end-to-end: ingest a document once, query it with a natural language question, and receive assembled context chunks ready to inject into a prompt.
Vector Store Endpoints
  • POST /v1/vector_stores — create store
  • GET /v1/vector_stores — list stores
  • POST /v1/vector_stores/{id}/files — ingest text (chunked + embedded)
  • POST /v1/vector_stores/{id}/search — top-K semantic search
RAG Endpoints
  • POST /v1/rag/ingest — chunk, embed, and store a document
  • POST /v1/rag/query — embed question, retrieve context, return chunks
MCP Authentication
12
OAuth Client Credentials Auto-Refresh
MCP Auth
MCP servers secured with OAuth 2.0 client credentials flow are fully supported. Smartflow automatically obtains, caches, and refreshes tokens on behalf of the caller. Compatible with Azure AD, Okta, Auth0, and any standards-compliant OAuth provider.
13
OAuth PKCE Per-User Browser Consent
MCP Auth
MCP servers requiring individual user consent use the PKCE flow. The user is redirected to the provider's consent screen; after authorisation, Smartflow exchanges the code and stores a user-scoped token in Redis. Tokens are scoped per user and server and expire independently.
Endpoints
  • GET /api/mcp/auth/initiate?server_id=...&user_id=...
  • GET /api/mcp/auth/callback
  • GET /.well-known/oauth-protected-resource
  • GET /.well-known/oauth-authorization-server
14
Per-Request Server Auth Header Forwarding
MCP Auth
Callers pass server-specific credentials using x-mcp-{alias}-{header-name} headers. Smartflow extracts and forwards them only to the intended server and strips them before forwarding to the end LLM. User-specific or request-specific credentials (session tokens, scoped API keys) can be forwarded to MCP servers without storing them centrally. No credential leakage between servers.
MCP Guardrail Modes
15
PreCall / DuringCall / Disabled Guardrail Modes
MCP Policy
Each registered MCP server declares its guardrail mode. PreCall— compliance is scanned before the tool is called; violations block the request.DuringCall— the MCP server response is scanned; violations in the response are caught before the result reaches the LLM.Disabled— no scanning for performance-sensitive internal tools.
Unlocks

Compliance coverage for both inbound tool requests and outbound tool responses — catching violations at both ends of the MCP call without requiring changes to the MCP server itself.

A2A Agent Gateway
16
Google A2A Open Protocol Gateway UNIQUE
Agent Gateway
Smartflow implements the Google A2A open protocol, making it interoperable with any A2A-compatible agent runtime: LangGraph, Vertex AI, Azure AI Foundry, Amazon Bedrock AgentCore, Pydantic AI. Agents are registered as named profiles in Redis with their own Agent Cards and skill declarations. External agent systems connect without custom integration code.
Unlocks

Cross-framework agent collaboration. LangGraph agents talk to Pydantic AI agents through Smartflow. Task chains span services with full traceability via X-A2A-Trace-Id. Task history is persisted in Redis for replay and audit.

Endpoints
  • GET /.well-known/agent.json — gateway Agent Card
  • GET /a2a/{id}/.well-known/agent.json — per-agent card
  • POST /a2a/{id} — tasks/send, tasks/sendSubscribe, tasks/get, tasks/cancel
  • GET/POST /api/a2a/agents — agent management
Routing
17
Fallback Chains with Retry and Backoff
Routing
Named fallback chains define an ordered list of provider targets. Retryable errors (429, 5xx) trigger exponential backoff before trying the next target. Non-retryable errors (4xx) move immediately to the next step. Chains are stored in Redis and manageable via API.
Unlocks

High-availability LLM routing with no single point of failure. Multi-provider redundancy configurable per model or use case without application-level changes.

18
Latency-Based + Tag-Based + Budget-Capped Routing
Routing
Latency-based (SMARTFLOW_ROUTING_STRATEGY=latency) — rolling p95 EMA tracked per provider in Redis; requests route to the fastest live provider.
Tag-based (strategy=tag) —x-smartflow-tags header matched against per-provider capability tags in Redis.
Budget caps (SMARTFLOW_PROVIDER_BUDGETS=openai:100,anthropic:50) — provider is skipped when its daily spend cap is reached; fallback chain takes over automatically.
Enterprise Identity
19
Microsoft Entra ID SSO + Group Sync
Identity
On every SSO sign-in, Smartflow decodes the OIDC id_token, extracts Entra group memberships and App Role claims, and automatically creates or updates Smartflow teams in Redis. Users are added to teams they belong to and removed from teams they have left. App Role values map to internal roles: proxy_admin, org_admin, proxy_admin_viewer, internal_user. Access controls, budgets, and guardrail policies attached to teams take effect immediately when membership changes in Entra.
Unlocks

Zero-touch team provisioning from Entra ID. No manual group-to-team mapping. Spend limits and compliance policies follow group membership automatically.

Endpoints
  • POST /api/auth/sso/config
  • POST /api/auth/sso/signin
  • GET /api/auth/sso/teams
  • GET /api/auth/sso/users/{id}/teams
Observability
20
Prometheus /metrics Endpoint
Observability
GET /metrics exposes text-format Prometheus metrics: per-provider daily spend, per-provider rolling p95 latency, MCP call counts and costs by server, vector store count, and version info. Scrape directly into any Prometheus + Grafana stack.
21
Standardised Response Headers
Observability
Every response from the proxy carries a complete observability header set:
x-smartflow-call-id— unique trace ID  | x-smartflow-response-cost— USD cost  | x-smartflow-cache-hit— true/false  | x-smartflow-duration-ms— end-to-end latency  | x-smartflow-provider— which provider served the response
22
Alerting Webhooks — Slack, Teams, Discord
Alerting
Fire-and-forget webhook alerts on: provider budget threshold breach, provider failure spike, and slow/hanging API calls. Configure via environment variables: SLACK_WEBHOOK_URL, TEAMS_WEBHOOK_URL, DISCORD_WEBHOOK_URL. Alerts are non-blocking and do not add latency to the request path.
23
VAS Log Audit Trail + Compliance Dashboard
Compliance
Every proxied request writes a structured VAS log entry capturing: user identity, model, provider, token counts, cost estimate, cache outcome, end-to-end latency, matched policy names, and guardrail decisions. Logs are persisted in Redis (hot) and MongoDB (archive). The compliance dashboard surfaces every entry with named users, full Q&A replay in a modal panel, and filterable policy-match views.
Unlocks

Complete per-user, per-request audit trail for compliance reporting. Every LLM call, every MCP tool invocation, and every A2A task is logged with the identity of who triggered it and which policies applied.

Complete Feature Summary
# Feature Area
1OpenAI drop-in— zero-change replacement for any OpenAI SDK client (/v1/chat/completions)LLM Proxy
2Anthropic drop-in— native Messages API passthrough (/anthropic/v1/messages); set ANTHROPIC_BASE_URL and goLLM Proxy
3Provider auto-routing by model name (gpt, claude, gemini, grok, mistral, command, ollama, deepseek…)LLM Proxy
4Cursor IDE + Claude Code + Claude Desktop passthrough (zero client changes)LLM Proxy
5Local model support — Ollama, GGUF/ONNX, DeepSeek self-hosted, vLLM, LM StudioLLM Proxy
6In-flight prompt compression — 39-pattern verbose reduction + semantic dedup (~65% ratio)Caching
74-Phase MetaCache— intent fingerprint → near-miss → exact key → VectorLite BERT KNN (all-MiniLM-L6-v2, 384-dim, ≥ 0.90)MetaCache
8Per-request cache controls (no-cache, no-store, ttl, namespace)Caching
9Transparent LLM-side prompt cache injection (Anthropic ephemeral, OpenAI prefix caching)Caching
10Virtual keys with spend budgets (daily / weekly / monthly / lifetime)Key Mgmt
11Provider key vault — raw credentials stored server-side, never exposed to clientsKey Mgmt
12Fallback chains with per-step retry and exponential backoffRouting
13Latency-based (p95 EMA), tag-based, and cost-based provider routingRouting
14Per-provider daily budget caps with automatic failoverRouting
15AI Policy Engine (Maestro)— learning-based guardrails evaluated by AI, not regexPolicy
16Policy groups with parent inheritance + tag-wildcard scopingPolicy
17Guardrail policy response headers on every requestPolicy
18MCP HTTP / SSE / STDIO transportsMCP
19Per-server tool allow/deny lists + parameter allow-listsMCP
20Semantic tool filtering via embedding indexMCP
21Built-in vector stores + RAG pipeline (no external vector DB)RAG
22MCP server aliases + per-alias routingMCP
23OAuth Client Credentials auto-refresh for MCPMCP Auth
24OAuth PKCE per-user browser consentMCP Auth
25Per-request auth header forwarding to MCP serversMCP Auth
26Public internet IP gating per MCP serverMCP
27MCP guardrail modes (PreCall / DuringCall / Disabled)MCP Policy
28MCP cost tracking by server / user / toolMCP Obs
29A2A agent gateway (Google A2A open protocol)A2A
30Agent Cards + task streaming via SSEA2A
31Cross-agent tracing (X-A2A-Trace-Id)A2A
32Microsoft Entra ID SSO + zero-touch group syncIdentity
33Prometheus /metrics endpointObservability
34Standardised response headers (cost, trace, cache, latency, provider)Observability
35Slack / Teams / Discord alerting webhooksAlerting
36VAS log audit trail with compliance dashboard + Q&A replayCompliance
37Kubernetes / Helm deployment (cert-manager TLS, NGINX ingress, horizontal scaling)Deployment