The MetaCache retrieval pipeline has been extended with a fourth phase: full BERT semantic similarity search using sentence-transformers/all-MiniLM-L6-v2 (384-dimensional vectors, local inference) against a K-nearest-neighbours index in Redis Stack.
When Phases 1–3 miss (intent fingerprint, near-miss, exact key), Phase 4 embeds the request text locally and searches all stored response embeddings for cosine similarity ≥0.90. A paraphrased question with different wording but the same meaning returns the cached response — no LLM call, no external API, no additional network hop.
all-MiniLM-L6-v2, 384-dim, cosine similarity ≥ 0.90. Redis Stack vector index. 5–20ms. Local BERT inference — no external vector DB.Support bots, knowledge base tools, and internal Q&A systems see dramatically higher cache hit rates. Phase 4 captures the most common real-world cache miss pattern: same question, different wording. Validated: similarity 0.91–0.95 for genuine paraphrases, correct misses below threshold.
The helm/smartflow chart is now validated for production use on managed Kubernetes. Tested on DigitalOcean Kubernetes (v1.35) with NGINX Ingress Controller, cert-manager TLS (Let's Encrypt), TimescaleDB StatefulSet, Redis Stack with PVC, and horizontal pod scaling across all services.
New Helm values:
proxy.image.pullPolicy: Always— force fresh pull on pod restartcompliance.replicas: 3— horizontal scale for concurrent validationpolicyPerfect.replicas: 2— horizontal scale for policy checksproxy.env.RATE_LIMIT_REQUESTS_PER_HOUR— per-IP rate limit tuningproxy.env.TOKIO_WORKER_THREADS: 16— async worker pool sizingSingle-command Kubernetes deployment on DigitalOcean, AWS EKS, GKE, and AKS. Automatic TLS, health probes, persistent storage, and service mesh-ready inter-pod routing out of the box.
std::sync::RwLock. On Linux, RwLock write-lock acquisition blocks all readers while pending. Under concurrent load with an empty Redis key store, every request triggered a refresh. All refresh threads competed for the write lock, blocked all readers, and exhausted the Tokio worker pool — complete proxy freeze.RwLock with std::sync::Mutex for write-fairness. Introduced AtomicU64for lock-free cache expiry checking — the hot read path never acquires any lock. Background refresh debounced withAtomicBool; only one refresh runs at a time, always in a detached std::thread::spawn— never on a Tokio worker thread.Proxy would completely freeze under 20+ concurrent requests when Redis key store was empty or stale. All Tokio workers blocked. HTTP server stopped accepting connections.
determine_compliance_info() used tokio::task::block_in_place(|| Handle::current().block_on(...))with a 60-second timeout. This occupied a Tokio worker thread for up to 60 seconds per request. Under concurrent load, all worker threads could be held simultaneously — deadlock.async fn. Compliance HTTP call is now directly .awaited with tokio::time::timeout(Duration::from_secs(8), ...). No worker thread is blocked. On timeout, check fails open — request passes through with a warning log. All four call sites updated to.await.
Requests with slow compliance service responses would hold a Tokio worker for 60 seconds. Four concurrent requests with slow compliance responses could exhaust a 4-worker pool entirely.
vas_log.policies_applied was an empty string. Splitting on commas produced vec![""]. Passing an empty string as a PostgreSQL UUID parameter caused invalid input syntax for type uuid: "" errors, returned to the proxy as HTTP 503 compliance failures.
filter(|s| !s.is_empty())). The load_policy()function in the storage layer also guards against empty IDs as defence-in-depth —if id.is_empty() { return Ok(None); }.
Every request from an anonymous (no-policy) user would fail with a 503 compliance error when POLICY_FAIL_OPEN=false. Blocked all unauthenticated API access.
| Metric | Bare Metal | K8s (2×s-4vcpu-8gb) |
|---|---|---|
| 20 concurrent cache hits — p50 | 0.14s | 0.57s |
| 20 concurrent cache hits — wall time | 0.22s | 0.59s |
| 15 concurrent semantic variants — p50 | 0.18s | 0.41s |
| HTTP 200 rate | 100% | 100% |
| Errors / deadlocks | 0 | 0 |
| # | Feature / Fix | Area |
|---|---|---|
| 1 | Phase 4 VectorLite BERT semantic KNN cache— all-MiniLM-L6-v2, 384-dim, cosine ≥ 0.90, Redis Stack | Caching |
| 2 | Four-phase MetaCache pipeline (intent fingerprint → near-miss → exact key → VectorLite) | Caching |
| 3 | Default similarity threshold raised 0.85 → 0.90 | Caching |
| 4 | Kubernetes / Helm chart production validation (DigitalOcean, NGINX ingress, cert-manager TLS) | Deployment |
| 5 | Compliance & policy-perfect horizontal scaling (replicas: 3/2) in Helm values | Deployment |
| 6 | Fix:Key store reader deadlock — Mutex + AtomicU64 expiry + background refresh | Bug Fix |
| 7 | Fix:Compliance check async refactor — 8s timeout, zero block_in_place | Bug Fix |
| 8 | Fix:MAESTRO UUID empty-string guard — filter + storage layer defence-in-depth | Bug Fix |