Smartflow 1.4 — Platform Capabilities

MetaCache

Semantic similarity caching using embeddings. Not exact-match — conceptually equivalent prior answers are served from cache, collapsing redundant LLM calls across rephrased queries.

AI Policy Engine

Guardrail decisions made by an AI reading your actual compliance policies — not a keyword blocklist. Thresholds adapt over time from your organisation's real compliance outcomes.

Unified LLM + MCP + A2A

One gateway. One audit trail. One policy engine. All three protocols share identity, budget enforcement, compliance logging, and semantic caching infrastructure.

LLM Proxy

OpenAI Drop-In — Universal EndpointZERO CHANGES

LLM Proxy

Complete drop-in replacement for the OpenAI API. Any client SDK, LangChain app, LlamaIndex pipeline, or tool that targets POST /v1/chat/completions works with zero code changes. Streaming (text/event-stream), function/tool calling, vision, and extended context are all supported.

client = OpenAI(base_url="https://your-smartflow/v1", api_key="sk-sf-...")

Unlocks

Every existing OpenAI SDK integration gains audit logging, policy enforcement, cost tracking, and 4-phase semantic caching without a single code change.

Anthropic Drop-In — Native Messages APIZERO CHANGES

LLM Proxy

Complete drop-in replacement for the Anthropic SDK. Set ANTHROPIC_BASE_URL to your Smartflow instance and use the Anthropic Python/TypeScript SDK unchanged. Requests to POST /anthropic/v1/messages are forwarded natively to api.anthropic.com/v1/messageswith all headers preserved —x-api-key, anthropic-version, anthropic-beta. All Smartflow features apply transparently.

client = Anthropic(base_url="https://your-smartflow/anthropic", api_key="sk-sf-...")

Claude Code and Claude Desktopconnect this way — setANTHROPIC_BASE_URL, nothing else changes. The [1m] extended-context suffix Claude Code appends to model names is stripped automatically before forwarding.

Unlocks

Every Anthropic SDK integration — Claude Code, Claude Desktop, custom agents — gets centralised audit logging, policy enforcement, spend budgeting, and semantic caching with no client changes. This is the same zero-change story as the OpenAI drop-in, applied to the native Anthropic Messages API.

Provider Auto-Routing by Model Name

LLM Proxy

Model name drives provider selection automatically — no per-client configuration:

gpt-*, o1/o3/o4-*, dall-e-*, whisper-*→ OpenAI
claude-*→ Anthropic
gemini-*→ Google
grok-*→ xAI
mistral-*, mixtral-*→ Mistral
command-*→ Cohere
ollama/*→ local Ollama instance
deepseek/*, groq/*, openrouter/*→ respective providers

Model prefix syntax works on any endpoint: send model: "anthropic/claude-sonnet-4-6" to /v1/chat/completions and Smartflow strips the prefix and routes to Anthropic correctly.

Local Model Support

LLM Proxy

Route to local inference without sending data to any cloud provider. All Smartflow features — compliance, caching, VAS logging — apply identically to local model traffic.

Ollama— setOLLAMA_BASE_URL, use any model tag, zero API key required
Local GGUF / ONNX— any OpenAI-compatible local inference server
DeepSeek self-hosted— viadeepseek/* model prefix
HuggingFace / vLLM / LM Studio— any server exposing/v1/chat/completions

Unlocks

Air-gapped or privacy-first deployments where all traffic stays on-premises. Organisations with on-prem GPU infrastructure route sensitive workloads to local models while lower-sensitivity traffic routes to cloud providers — all through one proxy with unified policy enforcement.

In-Flight Prompt Compression

Caching

On every cache miss, outgoing request bodies are compressed before forwarding. Completely transparent — clients send and receive standard API payloads unchanged.

Verbose phrase reduction— 39 patterns replaced automatically ("In order to" → "To"). 20–30% token reduction on prose-heavy system prompts.
Semantic deduplication— repeated concepts across message history detected via embeddings and replaced with compact references, resolved transparently on response.
Provider-aware— aggressiveness calibrated per provider. Token savings tracked per request in VAS logs.

Production result: ~65% average compression ratio.

Unlocks

Lower provider costs on every cache miss, not just cache hits. A chat application with a 2,000-token system prompt pays for ~700 tokens after compression, on every forwarded request, with no client changes.

Virtual Keys with Spend Budgets

Key Management

Smartflow-issued virtual keys (sk-sf-...) carry hard spend limits scoped to teams, models, or use cases. Budget periods: daily, weekly, monthly, lifetime. Budget is checked before every request — if exceeded, the request returns429 before any provider cost is incurred. Spend is recorded after each response using actual token cost.

Unlocks

Per-user, per-team, and per-application cost caps with zero-tolerance enforcement. Issue keys to internal users or external partners with guaranteed spend control.

MetaCache — 4-Phase Semantic Cache

4-Phase Semantic Cache with VectorLite BERT KNN UNIQUE

MetaCache

Every request passes through four lookup phases before any token is sent to a provider. The first hit wins.

Phase 1— Intent fingerprint exact match (in-process, <1ms). Catches repeated questions regardless of minor wording variation.
Phase 2— Intent fingerprint near-miss via token-overlap scoring. Catches reformulations of the same core question.
Phase 3— SHA-256 exact key lookup in Redis (1–3ms). Reliable for structured payloads, multimodal inputs, and code prompts.
Phase 4— VectorLite BERT KNN: request embedded withsentence-transformers/all-MiniLM-L6-v2 (384-dim, local inference), then K-nearest-neighbours search against all stored response embeddings in Redis Stack using cosine similarity. Threshold: 0.90. Catches paraphrased questions — the most common real-world pattern missed by all prior phases.

Responses are stored with their BERT embeddings at write time. No external API call, no Qdrant, no Weaviate — BERT inference runs natively inside the proxy process.

Why This Matters

A user asking "what are the side effects of ibuprofen?" and another asking "can ibuprofen cause stomach problems?" resolve to the same cached response at Phase 4 (similarity ≈ 0.92). Exact-match caches and fingerprint caches miss this entirely. With real-world query traffic, Phase 4 hits represent the largest portion of cache savings after initial warm-up.

Unlocks

Dramatically lower provider costs on repeated-topic workloads — support bots, internal Q&A, knowledge base tools. Phase 4 hit latency is 5–20ms (local BERT + Redis KNN), versus hundreds of milliseconds for a live LLM call. No application changes required.

Per-Request Cache Controls

Caching

Callers control caching behaviour on individual requests without changing server configuration:
Cache-Control: no-cache— bypass read | no-store— bypass write
x-smartflow-cache-ttl: 3600— override TTL | x-smartflow-cache-namespace: team-a— scope to a logical partition

Every cached response returns x-smartflow-cache-hit: true and x-smartflow-cache-key for client-side correlation.

Unlocks

Mix cacheable and non-cacheable calls in the same integration. Real-time lookups that must never be stale coexist with deterministic queries that benefit from caching — controlled per-request, not per-route.

AI Policy Engine — Maestro

Learning-Based Guardrails UNIQUE

Policy Engine

Maestro runs as a pre-call and post-call validation pass. It reads your organisation's compliance policies (stored as documents in the Policy Perfect API), embeds them, and evaluates each request against the semantic intentof those policies — not surface-level keyword matching.

The Policy Perfect API maintains a living corpus of compliance decisions. Every flagged or blocked request outcome is fed back into the policy model. Guardrail thresholds adapt over time based on your organisation's actual decisions — not a vendor's preset calibration.

Why This Matters

Static keyword blocklists block "kill process" in a DevOps context but miss subtle policy violations in legal or medical text. Maestro reads the policy the same way a compliance officer would and makes contextual judgements. It gets better as your team reviews its decisions.

Unlocks

Compliance enforcement that improves over time. Zero false positives from keyword collisions. A policy engine that matches the nuance of your actual compliance requirements rather than a generic vendor baseline.

Policy Groups with Inheritance + Tag-Wildcard Scoping

Policy

Named guardrails (pii, toxicity, prompt_injection, compliance, custom) are grouped into named policies. Policies inherit from a parent and override specific guardrails — no duplication. Policies attach to scopes: team, virtual-key alias, model pattern, or tag wildcard (e.g.hipaa-*). Every proxied request returns headers listing which policies matched and why.

Endpoints

POST /api/guardrails, GET /api/guardrails
POST /api/policies, GET /api/policies/{name}
POST /api/policies/attachments
POST /api/policies/resolve — preview guardrails for a context

MCP Gateway

HTTP, SSE, and STDIO Transports

MCP Gateway

Every MCP transport type is supported. HTTP— standard JSON-RPC 2.0 over HTTPS.SSE— proxy opens and maintains the Server-Sent Events stream; events are parsed and routed transparently.STDIO— Smartflow spawns the local child process, communicates over stdin/stdout using JSON-RPC 2.0, and manages the process lifecycle automatically.

Unlocks

Real-time event-driven MCP servers, local CLI tools, and standard community servers (GitHub CLI, filesystem access, local databases) can all be registered and used through the same gateway.

Per-Server Tool Access Control

MCP Gateway

Each MCP server carries a fine-grained access policy: allowed_tools (whitelist), disallowed_tools (blacklist, overrides whitelist), and allowed_params (per-tool parameter allow-lists). Requests that call disallowed tools or pass disallowed parameters are rejected at the gateway before reaching the MCP server. available_on_public_internet: false blocks requests from non-RFC-1918 IPs entirely.

Unlocks

Fine-grained least-privilege enforcement for every MCP server — no trust required at the MCP server level. Public-facing deployments with private MCP infrastructure are safely isolated.

Semantic Tool Filtering

MCP Gateway

All tools across all registered servers are indexed with embeddings of their name, description, and parameter signatures. Passing x-mcp-query: summarize a PDF on a tools/listrequest returns only semantically relevant tools — not the full catalogue. Agents discover capabilities byintent, not by knowing server names.

Endpoints

GET /api/mcp/tools/search?q=read+a+file&k=5
POST /api/mcp/tools/reindex

Vector Stores & RAG

11
Built-In Vector Store API UNIQUE
Vector / RAG
OpenAI-compatible vector store API backed by Redis and EmbeddingService — no external vector database required. CRUD for stores, automatic chunk-and-embed on file ingest, and top-K semantic search. The RAG pipeline endpoints compose retrieval-augmented generation end-to-end: ingest a document once, query it with a natural language question, and receive assembled context chunks ready to inject into a prompt.
Vector Store Endpoints
POST /v1/vector_stores — create store
GET  /v1/vector_stores — list stores
POST /v1/vector_stores/{id}/files — ingest text (chunked + embedded)
POST /v1/vector_stores/{id}/search — top-K semantic search
RAG Endpoints
POST /v1/rag/ingest — chunk, embed, and store a document
POST /v1/rag/query — embed question, retrieve context, return chunks

MCP Authentication

OAuth Client Credentials Auto-Refresh

MCP Auth

MCP servers secured with OAuth 2.0 client credentials flow are fully supported. Smartflow automatically obtains, caches, and refreshes tokens on behalf of the caller. Compatible with Azure AD, Okta, Auth0, and any standards-compliant OAuth provider.

OAuth PKCE Per-User Browser Consent

MCP Auth

MCP servers requiring individual user consent use the PKCE flow. The user is redirected to the provider's consent screen; after authorisation, Smartflow exchanges the code and stores a user-scoped token in Redis. Tokens are scoped per user and server and expire independently.

Endpoints

GET /api/mcp/auth/initiate?server_id=...&user_id=...
GET /api/mcp/auth/callback
GET /.well-known/oauth-protected-resource
GET /.well-known/oauth-authorization-server

Per-Request Server Auth Header Forwarding

MCP Auth

Callers pass server-specific credentials using x-mcp-{alias}-{header-name} headers. Smartflow extracts and forwards them only to the intended server and strips them before forwarding to the end LLM. User-specific or request-specific credentials (session tokens, scoped API keys) can be forwarded to MCP servers without storing them centrally. No credential leakage between servers.

MCP Guardrail Modes

PreCall / DuringCall / Disabled Guardrail Modes

MCP Policy

Each registered MCP server declares its guardrail mode. PreCall— compliance is scanned before the tool is called; violations block the request.DuringCall— the MCP server response is scanned; violations in the response are caught before the result reaches the LLM.Disabled— no scanning for performance-sensitive internal tools.

Unlocks

Compliance coverage for both inbound tool requests and outbound tool responses — catching violations at both ends of the MCP call without requiring changes to the MCP server itself.

A2A Agent Gateway

Google A2A Open Protocol Gateway UNIQUE

Agent Gateway

Smartflow implements the Google A2A open protocol, making it interoperable with any A2A-compatible agent runtime: LangGraph, Vertex AI, Azure AI Foundry, Amazon Bedrock AgentCore, Pydantic AI. Agents are registered as named profiles in Redis with their own Agent Cards and skill declarations. External agent systems connect without custom integration code.

Unlocks

Cross-framework agent collaboration. LangGraph agents talk to Pydantic AI agents through Smartflow. Task chains span services with full traceability via X-A2A-Trace-Id. Task history is persisted in Redis for replay and audit.

Endpoints

GET /.well-known/agent.json — gateway Agent Card
GET /a2a/{id}/.well-known/agent.json — per-agent card
POST /a2a/{id} — tasks/send, tasks/sendSubscribe, tasks/get, tasks/cancel
GET/POST /api/a2a/agents — agent management

Routing

Fallback Chains with Retry and Backoff

Routing

Named fallback chains define an ordered list of provider targets. Retryable errors (429, 5xx) trigger exponential backoff before trying the next target. Non-retryable errors (4xx) move immediately to the next step. Chains are stored in Redis and manageable via API.

Unlocks

High-availability LLM routing with no single point of failure. Multi-provider redundancy configurable per model or use case without application-level changes.

Latency-Based + Tag-Based + Budget-Capped Routing

Routing

Latency-based (SMARTFLOW_ROUTING_STRATEGY=latency) — rolling p95 EMA tracked per provider in Redis; requests route to the fastest live provider.
Tag-based (strategy=tag) —x-smartflow-tags header matched against per-provider capability tags in Redis.
Budget caps (SMARTFLOW_PROVIDER_BUDGETS=openai:100,anthropic:50) — provider is skipped when its daily spend cap is reached; fallback chain takes over automatically.

Enterprise Identity

Microsoft Entra ID SSO + Group Sync

Identity

On every SSO sign-in, Smartflow decodes the OIDC id_token, extracts Entra group memberships and App Role claims, and automatically creates or updates Smartflow teams in Redis. Users are added to teams they belong to and removed from teams they have left. App Role values map to internal roles: proxy_admin, org_admin, proxy_admin_viewer, internal_user. Access controls, budgets, and guardrail policies attached to teams take effect immediately when membership changes in Entra.

Unlocks

Zero-touch team provisioning from Entra ID. No manual group-to-team mapping. Spend limits and compliance policies follow group membership automatically.

Endpoints

POST /api/auth/sso/config
POST /api/auth/sso/signin
GET /api/auth/sso/teams
GET /api/auth/sso/users/{id}/teams

Observability

Prometheus /metrics Endpoint

Observability

GET /metrics exposes text-format Prometheus metrics: per-provider daily spend, per-provider rolling p95 latency, MCP call counts and costs by server, vector store count, and version info. Scrape directly into any Prometheus + Grafana stack.

Standardised Response Headers

Observability

Every response from the proxy carries a complete observability header set:
x-smartflow-call-id— unique trace ID | x-smartflow-response-cost— USD cost | x-smartflow-cache-hit— true/false | x-smartflow-duration-ms— end-to-end latency | x-smartflow-provider— which provider served the response

Alerting Webhooks — Slack, Teams, Discord

Alerting

Fire-and-forget webhook alerts on: provider budget threshold breach, provider failure spike, and slow/hanging API calls. Configure via environment variables: SLACK_WEBHOOK_URL, TEAMS_WEBHOOK_URL, DISCORD_WEBHOOK_URL. Alerts are non-blocking and do not add latency to the request path.

VAS Log Audit Trail + Compliance Dashboard

Compliance

Every proxied request writes a structured VAS log entry capturing: user identity, model, provider, token counts, cost estimate, cache outcome, end-to-end latency, matched policy names, and guardrail decisions. Logs are persisted in Redis (hot) and MongoDB (archive). The compliance dashboard surfaces every entry with named users, full Q&A replay in a modal panel, and filterable policy-match views.

Unlocks

Complete per-user, per-request audit trail for compliance reporting. Every LLM call, every MCP tool invocation, and every A2A task is logged with the identity of who triggered it and which policies applied.

Complete Feature Summary

#	Feature	Area
1	OpenAI drop-in— zero-change replacement for any OpenAI SDK client (`/v1/chat/completions`)	LLM Proxy
2	Anthropic drop-in— native Messages API passthrough (`/anthropic/v1/messages`); set `ANTHROPIC_BASE_URL` and go	LLM Proxy
3	Provider auto-routing by model name (gpt, claude, gemini, grok, mistral, command, ollama, deepseek…)	LLM Proxy
4	Cursor IDE + Claude Code + Claude Desktop passthrough (zero client changes)	LLM Proxy
5	Local model support — Ollama, GGUF/ONNX, DeepSeek self-hosted, vLLM, LM Studio	LLM Proxy
6	In-flight prompt compression — 39-pattern verbose reduction + semantic dedup (~65% ratio)	Caching
7	4-Phase MetaCache— intent fingerprint → near-miss → exact key → VectorLite BERT KNN (all-MiniLM-L6-v2, 384-dim, ≥ 0.90)	MetaCache
8	Per-request cache controls (no-cache, no-store, ttl, namespace)	Caching
9	Transparent LLM-side prompt cache injection (Anthropic ephemeral, OpenAI prefix caching)	Caching
10	Virtual keys with spend budgets (daily / weekly / monthly / lifetime)	Key Mgmt
11	Provider key vault — raw credentials stored server-side, never exposed to clients	Key Mgmt
12	Fallback chains with per-step retry and exponential backoff	Routing
13	Latency-based (p95 EMA), tag-based, and cost-based provider routing	Routing
14	Per-provider daily budget caps with automatic failover	Routing
15	AI Policy Engine (Maestro)— learning-based guardrails evaluated by AI, not regex	Policy
16	Policy groups with parent inheritance + tag-wildcard scoping	Policy
17	Guardrail policy response headers on every request	Policy
18	MCP HTTP / SSE / STDIO transports	MCP
19	Per-server tool allow/deny lists + parameter allow-lists	MCP
20	Semantic tool filtering via embedding index	MCP
21	Built-in vector stores + RAG pipeline (no external vector DB)	RAG
22	MCP server aliases + per-alias routing	MCP
23	OAuth Client Credentials auto-refresh for MCP	MCP Auth
24	OAuth PKCE per-user browser consent	MCP Auth
25	Per-request auth header forwarding to MCP servers	MCP Auth
26	Public internet IP gating per MCP server	MCP
27	MCP guardrail modes (PreCall / DuringCall / Disabled)	MCP Policy
28	MCP cost tracking by server / user / tool	MCP Obs
29	A2A agent gateway (Google A2A open protocol)	A2A
30	Agent Cards + task streaming via SSE	A2A
31	Cross-agent tracing (X-A2A-Trace-Id)	A2A
32	Microsoft Entra ID SSO + zero-touch group sync	Identity
33	Prometheus /metrics endpoint	Observability
34	Standardised response headers (cost, trace, cache, latency, provider)	Observability
35	Slack / Teams / Discord alerting webhooks	Alerting
36	VAS log audit trail with compliance dashboard + Q&A replay	Compliance
37	Kubernetes / Helm deployment (cert-manager TLS, NGINX ingress, horizontal scaling)	Deployment

Smartflow Platform Capabilities

MetaCache

AI Policy Engine

Unified LLM + MCP + A2A