Smartflow 1.5 — Release Announcement

Caching

Phase 4 — VectorLite BERT Semantic KNN CacheNEW

MetaCache

The MetaCache retrieval pipeline has been extended with a fourth phase: full BERT semantic similarity search using sentence-transformers/all-MiniLM-L6-v2 (384-dimensional vectors, local inference) against a K-nearest-neighbours index in Redis Stack.

When Phases 1–3 miss (intent fingerprint, near-miss, exact key), Phase 4 embeds the request text locally and searches all stored response embeddings for cosine similarity ≥0.90. A paraphrased question with different wording but the same meaning returns the cached response — no LLM call, no external API, no additional network hop.

Phase 1

Intent Fingerprint Exact Match

In-process memory, <1ms. Catches repeated questions regardless of minor phrasing variation.

Phase 2

Intent Fingerprint Near-Miss

Token-overlap scoring on intent signatures. Catches reformulations of the same core question.

Phase 3

SHA-256 Exact Key (Redis)

1–3ms Redis lookup. Reliable fallback for structured payloads and multimodal inputs.

Phase 4 — NEW

VectorLite BERT KNN Search

all-MiniLM-L6-v2, 384-dim, cosine similarity ≥ 0.90. Redis Stack vector index. 5–20ms. Local BERT inference — no external vector DB.

Unlocks

Support bots, knowledge base tools, and internal Q&A systems see dramatically higher cache hit rates. Phase 4 captures the most common real-world cache miss pattern: same question, different wording. Validated: similarity 0.91–0.95 for genuine paraphrases, correct misses below threshold.

Default Similarity Threshold Updated: 0.85 → 0.90

MetaCache

The default cosine similarity threshold for Phase 4 cache hits has been raised from 0.85 to 0.90. The higher threshold reduces false-positive hits (returning a cached response for a question that is related but semantically distinct) while maintaining strong recall for genuine paraphrases. Configurable per-deployment via environment variable.

Deployment

Kubernetes / Helm Chart — Production Validated

Deployment

The helm/smartflow chart is now validated for production use on managed Kubernetes. Tested on DigitalOcean Kubernetes (v1.35) with NGINX Ingress Controller, cert-manager TLS (Let's Encrypt), TimescaleDB StatefulSet, Redis Stack with PVC, and horizontal pod scaling across all services.

New Helm values:

proxy.image.pullPolicy: Always— force fresh pull on pod restart
compliance.replicas: 3— horizontal scale for concurrent validation
policyPerfect.replicas: 2— horizontal scale for policy checks
proxy.env.RATE_LIMIT_REQUESTS_PER_HOUR— per-IP rate limit tuning
proxy.env.TOKIO_WORKER_THREADS: 16— async worker pool sizing

Unlocks

Single-command Kubernetes deployment on DigitalOcean, AWS EKS, GKE, and AKS. Automatic TLS, health probes, persistent storage, and service mesh-ready inter-pod routing out of the box.

Reliability & Concurrency Fixes

Fix: Key Store Reader Deadlock (RwLock Starvation)

Bug Fix

Root cause: API key refresh was executing synchronously on the Tokio request path using std::sync::RwLock. On Linux, RwLock write-lock acquisition blocks all readers while pending. Under concurrent load with an empty Redis key store, every request triggered a refresh. All refresh threads competed for the write lock, blocked all readers, and exhausted the Tokio worker pool — complete proxy freeze.

Fix: Replaced RwLock with std::sync::Mutex for write-fairness. Introduced AtomicU64for lock-free cache expiry checking — the hot read path never acquires any lock. Background refresh debounced withAtomicBool; only one refresh runs at a time, always in a detached std::thread::spawn— never on a Tokio worker thread.

Impact Before Fix

Proxy would completely freeze under 20+ concurrent requests when Redis key store was empty or stale. All Tokio workers blocked. HTTP server stopped accepting connections.

Fix: Compliance Check Blocking Tokio Runtime

Bug Fix

Root cause: determine_compliance_info() used tokio::task::block_in_place(|| Handle::current().block_on(...))with a 60-second timeout. This occupied a Tokio worker thread for up to 60 seconds per request. Under concurrent load, all worker threads could be held simultaneously — deadlock.

Fix: Function converted to async fn. Compliance HTTP call is now directly .awaited with tokio::time::timeout(Duration::from_secs(8), ...). No worker thread is blocked. On timeout, check fails open — request passes through with a warning log. All four call sites updated to.await.

Impact Before Fix

Requests with slow compliance service responses would hold a Tokio worker for 60 seconds. Four concurrent requests with slow compliance responses could exhaust a 4-worker pool entirely.

Fix: MAESTRO UUID Validation Error for Anonymous Users

Bug Fix

Root cause: For requests with no policies applied (anonymous users), vas_log.policies_applied was an empty string. Splitting on commas produced vec![""]. Passing an empty string as a PostgreSQL UUID parameter caused invalid input syntax for type uuid: "" errors, returned to the proxy as HTTP 503 compliance failures.

Fix: Empty strings are filtered from policy ID lists before dispatch (filter(|s| !s.is_empty())). The load_policy()function in the storage layer also guards against empty IDs as defence-in-depth —if id.is_empty() { return Ok(None); }.

Impact Before Fix

Every request from an anonymous (no-policy) user would fail with a 503 compliance error when POLICY_FAIL_OPEN=false. Blocked all unauthenticated API access.

Post-Fix Concurrency Benchmark

Concurrency Benchmark — K8s vs Bare Metal

Performance

Post-fix benchmarks across 20 concurrent cache-hit requests and 15 concurrent Phase 4 semantic variant requests:

Metric	Bare Metal	K8s (2×s-4vcpu-8gb)
20 concurrent cache hits — p50	0.14s	0.57s
20 concurrent cache hits — wall time	0.22s	0.59s
15 concurrent semantic variants — p50	0.18s	0.41s
HTTP 200 rate	100%	100%
Errors / deadlocks	0	0

K8s overhead is network and container runtime latency. Application-level deadlocks are fully resolved on both platforms.

Release Summary

#	Feature / Fix	Area
1	Phase 4 VectorLite BERT semantic KNN cache— all-MiniLM-L6-v2, 384-dim, cosine ≥ 0.90, Redis Stack	Caching
2	Four-phase MetaCache pipeline (intent fingerprint → near-miss → exact key → VectorLite)	Caching
3	Default similarity threshold raised 0.85 → 0.90	Caching
4	Kubernetes / Helm chart production validation (DigitalOcean, NGINX ingress, cert-manager TLS)	Deployment
5	Compliance & policy-perfect horizontal scaling (replicas: 3/2) in Helm values	Deployment
6	Fix:Key store reader deadlock — Mutex + AtomicU64 expiry + background refresh	Bug Fix
7	Fix:Compliance check async refactor — 8s timeout, zero block_in_place	Bug Fix
8	Fix:MAESTRO UUID empty-string guard — filter + storage layer defence-in-depth	Bug Fix