Client Request
Any SDK / App
Layer 1
4-Phase MetaCache
Fingerprint · Key · BERT
Layer 2
Prompt Compression
~30% token reduction
Layer 3
Provider Cache Inject
Anthropic · OpenAI
Provider
Only on cache miss
compressed + cached
Layer 1 — Four-Phase Local Cache (MetaCache)
Every request is checked against four lookup phases in order before any token is sent to a provider.
The first phase to return a match wins — remaining phases are skipped entirely.
A cache hit at any phase means zero provider tokens consumed and zero LLM latency.
Phase 1
Intent Fingerprint Exact Match
Request normalised and intent-fingerprinted. Exact match returns from in-process memory in <1ms. Covers repeated questions regardless of minor phrasing variation.
Phase 2
Intent Fingerprint Near-Miss
Token-overlap scoring checks for close-enough intent signatures when the exact fingerprint is absent. Catches reformulations of the same core question.
Phase 3
SHA-256 Exact Key (Redis)
Reliable fallback for structured payloads, multimodal inputs, and code prompts where intent fingerprinting is less applicable. 1–3ms Redis lookup.
Phase 4 — NEW
VectorLite BERT Semantic KNN
Request embedded with sentence-transformers/all-MiniLM-L6-v2 (384-dim, local BERT inference). KNN search against all stored response embeddings in Redis Stack using cosine similarity. Threshold: 0.90. Catches paraphrased questions — the most common real-world cache miss pattern. No external API, no Qdrant, no Weaviate.
Why Phase 4 Changes Everything
"How fast does light travel?" and "What is the speed of light?" are lexically different but semantically identical. Phases 1–3 miss this. Phase 4 catches it at similarity ≥ 0.90 — returning the cached response in 5–20ms instead of a 500ms+ live LLM call. In real-world traffic, Phase 4 represents the largest hit category after warm-up.
Persistent archive (MongoDB) runs asynchronously in parallel with every response — not a lookup phase, adds zero latency.
Layer 2 — Prompt Compression (In-Flight)
On every cache miss, outgoing request bodies are compressed before forwarding to the provider.
Completely transparent — clients send and receive standard API payloads unchanged. Compression is invisible to both the client and the provider.
Verbose phrase reduction— 39 common verbose patterns are replaced with concise equivalents automatically.
"In order to" → "To". "Due to the fact that" → "Because". Average 20–30% token reduction on prose-heavy system prompts.
Semantic deduplication— repeated concepts across the message history are detected via embedding similarity
and replaced with compact references. References are resolved transparently before the response reaches the client.
Model-aware tuning— compression aggressiveness is calibrated per provider. Anthropic models tolerate
tighter compression than GPT-3.5-class models. Token savings are tracked per request in VAS logs.
65%
Avg compression
ratio (production)
1.25M
Tokens saved
in production
39
Verbose patterns
auto-optimised
0ms
Client-visible
overhead
Unlocks
Lower provider costs on every cache miss — not just cache hits. A chat application with a 2,000-token system prompt pays for ~700 tokens after compression, on every forwarded request, with no client changes.
Layer 3 — Transparent Provider-Side Cache Injection
Anthropic (Claude 3+) caches repeated input prefixes on their infrastructure at ~$3/MTok vs
~$15/MTokfor full input — an 80% reduction on cached tokens.
OpenAI caches automatically for prompts over 1,024 tokens.
The problem: clients must opt in by attaching cache markers to requests. Most never do.
Smartflow's prompt cache injector adds the signal transparently on every proxied request — zero client changes required.
Large system messages (≥ 4,000 chars / ~1,024 tokens)
cache_control: {type: ephemeral} injected on every request. Primes the provider cache on first call; all subsequent calls within the 5-minute window read from cache at the reduced rate.
Repetitive medium messages (≥ 2,000 chars, seen ≥ 3 times)
Injected automatically once the repetition threshold is reached via Redis hash tracking (SHA-256, 15-min TTL). Covers apps with shorter but heavily reused system prompts.
Response-side tracking
cache_read_input_tokens (Anthropic) and prompt_tokens_details.cached_tokens (OpenAI) parsed from every provider response. Savings logged in VAS logs with estimated dollar value.
Cost Impact
For a high-volume deployment with a 4,000-char system prompt sending 10,000 requests/day to Claude: without Smartflow, every request pays full $15/MTok input. With injection, after the first call each day, the prefix is cached — saving ~80% on that portion. Zero client changes required.
Combined Savings — Request Scenarios
| Request Type |
Layer 1 — MetaCache |
Layer 2 — Compression |
Layer 3 — Provider Cache |
Net Saving |
| Repeated query — exact wording |
Phase 1 hit |
— |
— |
~100% |
| Same intent, different phrasing |
Phase 2 hit |
— |
— |
~100% |
| Structured / multimodal exact repeat |
Phase 3 hit (SHA-256) |
— |
— |
~100% |
| Paraphrased question (≥ 0.90 similarity) |
Phase 4 VectorLite BERT hit |
— |
— |
~100% |
| New query — large system prompt |
Miss |
20–30% tokens saved |
60–90% on cached prefix |
70–95% |
| New query — medium repetitive prompt |
Miss |
20–30% tokens saved |
After 3rd call: 60–80% |
Up to 85% |
| New query — novel content |
Miss |
20–30% tokens saved |
None |
20–30% |
| Typical AI gateway (no Smartflow) |
Exact match only |
None |
None |
0–5% |
What Other Solutions Do
| Capability |
Smartflow |
LiteLLM |
Basic AI Proxy |
| Exact-match cache |
|
|
|
| Semantic cache |
|
Qdrant required |
|
| BERT semantic KNN (local, no ext. DB) |
|
|
|
| In-flight prompt compression |
|
|
|
| Provider-side cache injection |
|
|
|
| Per-request cache controls |
|
Partial |
|
| Cost savings tracked & reported |
|
|
|