Smartflow Cache Ecosystem — Beyond Local Cache

Client Request

Any SDK / App

Layer 1

4-Phase MetaCache
Fingerprint · Key · BERT

Layer 2

Prompt Compression
~30% token reduction

Layer 3

Provider Cache Inject
Anthropic · OpenAI

Provider

Only on cache miss
compressed + cached

Layer 1 — Four-Phase Local Cache (MetaCache)

Four-Phase Lookup Pipeline UNIQUE

MetaCache

Every request is checked against four lookup phases in order before any token is sent to a provider. The first phase to return a match wins — remaining phases are skipped entirely. A cache hit at any phase means zero provider tokens consumed and zero LLM latency.

Phase 1

Intent Fingerprint Exact Match

Request normalised and intent-fingerprinted. Exact match returns from in-process memory in <1ms. Covers repeated questions regardless of minor phrasing variation.

Phase 2

Intent Fingerprint Near-Miss

Token-overlap scoring checks for close-enough intent signatures when the exact fingerprint is absent. Catches reformulations of the same core question.

Phase 3

SHA-256 Exact Key (Redis)

Reliable fallback for structured payloads, multimodal inputs, and code prompts where intent fingerprinting is less applicable. 1–3ms Redis lookup.

Phase 4 — NEW

VectorLite BERT Semantic KNN

Request embedded with sentence-transformers/all-MiniLM-L6-v2 (384-dim, local BERT inference). KNN search against all stored response embeddings in Redis Stack using cosine similarity. Threshold: 0.90. Catches paraphrased questions — the most common real-world cache miss pattern. No external API, no Qdrant, no Weaviate.

Why Phase 4 Changes Everything

"How fast does light travel?" and "What is the speed of light?" are lexically different but semantically identical. Phases 1–3 miss this. Phase 4 catches it at similarity ≥ 0.90 — returning the cached response in 5–20ms instead of a 500ms+ live LLM call. In real-world traffic, Phase 4 represents the largest hit category after warm-up.

Persistent archive (MongoDB) runs asynchronously in parallel with every response — not a lookup phase, adds zero latency.

Layer 2 — Prompt Compression (In-Flight)

In-Flight Token Reduction

Compression

On every cache miss, outgoing request bodies are compressed before forwarding to the provider. Completely transparent — clients send and receive standard API payloads unchanged. Compression is invisible to both the client and the provider.

Verbose phrase reduction— 39 common verbose patterns are replaced with concise equivalents automatically. "In order to" → "To". "Due to the fact that" → "Because". Average 20–30% token reduction on prose-heavy system prompts.

Semantic deduplication— repeated concepts across the message history are detected via embedding similarity and replaced with compact references. References are resolved transparently before the response reaches the client.

Model-aware tuning— compression aggressiveness is calibrated per provider. Anthropic models tolerate tighter compression than GPT-3.5-class models. Token savings are tracked per request in VAS logs.

65%

Avg compression
ratio (production)

1.25M

Tokens saved
in production

Verbose patterns
auto-optimised

0ms

Client-visible
overhead

Unlocks

Lower provider costs on every cache miss — not just cache hits. A chat application with a 2,000-token system prompt pays for ~700 tokens after compression, on every forwarded request, with no client changes.

Layer 3 — Transparent Provider-Side Cache Injection

Automatic Prompt Cache Markers

Provider Cache

Anthropic (Claude 3+) caches repeated input prefixes on their infrastructure at ~$3/MTok vs ~$15/MTokfor full input — an 80% reduction on cached tokens. OpenAI caches automatically for prompts over 1,024 tokens.

The problem: clients must opt in by attaching cache markers to requests. Most never do. Smartflow's prompt cache injector adds the signal transparently on every proxied request — zero client changes required.

Large system messages (≥ 4,000 chars / ~1,024 tokens)

cache_control: {type: ephemeral} injected on every request. Primes the provider cache on first call; all subsequent calls within the 5-minute window read from cache at the reduced rate.

Repetitive medium messages (≥ 2,000 chars, seen ≥ 3 times)

Injected automatically once the repetition threshold is reached via Redis hash tracking (SHA-256, 15-min TTL). Covers apps with shorter but heavily reused system prompts.

Response-side tracking

cache_read_input_tokens (Anthropic) and prompt_tokens_details.cached_tokens (OpenAI) parsed from every provider response. Savings logged in VAS logs with estimated dollar value.

Cost Impact

For a high-volume deployment with a 4,000-char system prompt sending 10,000 requests/day to Claude: without Smartflow, every request pays full $15/MTok input. With injection, after the first call each day, the prefix is cached — saving ~80% on that portion. Zero client changes required.

Combined Savings — Request Scenarios

Request Type	Layer 1 — MetaCache	Layer 2 — Compression	Layer 3 — Provider Cache	Net Saving
Repeated query — exact wording	Phase 1 hit	—	—	~100%
Same intent, different phrasing	Phase 2 hit	—	—	~100%
Structured / multimodal exact repeat	Phase 3 hit (SHA-256)	—	—	~100%
Paraphrased question (≥ 0.90 similarity)	Phase 4 VectorLite BERT hit	—	—	~100%
New query — large system prompt	Miss	20–30% tokens saved	60–90% on cached prefix	70–95%
New query — medium repetitive prompt	Miss	20–30% tokens saved	After 3rd call: 60–80%	Up to 85%
New query — novel content	Miss	20–30% tokens saved	None	20–30%
Typical AI gateway (no Smartflow)	Exact match only	None	None	0–5%

What Other Solutions Do

Capability	Smartflow	LiteLLM	Basic AI Proxy
Exact-match cache
Semantic cache		Qdrant required
BERT semantic KNN (local, no ext. DB)
In-flight prompt compression
Provider-side cache injection
Per-request cache controls		Partial
Cost savings tracked & reported

Smartflow Cache EcosystemBeyond Local Cache

Smartflow Cache Ecosystem
Beyond Local Cache