Production-hardened configuration for Smartflow on Docker Compose and Kubernetes — covering resource tuning, ingress optimization, TLS automation, performance configuration, and a complete operational runbook.
| Deployment Model | Best For | HA / Scaling | Latency Profile | Complexity |
|---|---|---|---|---|
| Docker Compose (bare metal) | Single-tenant, pilot, dev | Manual / none | Lowest — direct localhost routing | Low |
| Docker Compose (VM) | Small teams, fixed workload | Manual restart only | Low — Caddy reverse proxy | Low |
| Kubernetes / Helm | Enterprise, multi-tenant, auto-scale | Full — HPA, rolling upgrades | +2–5ms overlay (tunable to ~+1ms) | Medium |
The canonical compose file for Smartflow is docker-compose.simple.yaml
in the Smartflow_docker repo. All four services share a single Docker image
(langsmartai/safechat-enterprise:latest); the SERVICE_TYPE
environment variable selects which binary runs.
Main LLM gateway. Handles all /v1/*, /anthropic/*,
/cursor/*, and MCP routes.
Port 7775
Management plane — virtual keys, guardrails, routing policies, VAS audit logs, analytics.
Port 7778
ML-powered compliance engine — intelligent scanning, learning feedback, org baselines.
Port 7777
Policy evaluation service for guardrail rule sets and advanced decision trees.
Port 7782
SafeChat Enterprise frontend — served by the chat container.
Port 3600
Always set mem_limit and cpus constraints in Compose. Without them
a runaway request loop (e.g. a streaming response that never terminates) can exhaust host
memory and take down all services.
# docker-compose.simple.yaml — recommended resource constraints services: smartflow-proxy: mem_limit: 1g memswap_limit: 1g # disable swap for this container cpus: 2.0 # allow up to 2 cores; adjust for your host restart: unless-stopped ulimits: nofile: soft: 65536 hard: 65536 smartflow-api-server: mem_limit: 512m cpus: 1.0 restart: unless-stopped smartflow-compliance: mem_limit: 768m # compliance ML models need headroom cpus: 1.5 restart: unless-stopped smartflow-policy-perfect: mem_limit: 256m cpus: 0.5 restart: unless-stopped
Caddy handles TLS termination and routes to the correct service port. Its automatic HTTPS via Let's Encrypt requires ports 80 and 443 to be open and the DNS A record to resolve to this host before Caddy starts.
# /etc/caddy/Caddyfile — production routing rules your-host.example.com { # Proxy routes → port 7775 handle /v1/* { reverse_proxy localhost:7775 } handle /anthropic/* { reverse_proxy localhost:7775 } handle /cursor/* { reverse_proxy localhost:7775 } handle /a2a/* { reverse_proxy localhost:7775 } handle /api/mcp/* { reverse_proxy localhost:7775 } handle /.well-known/* { reverse_proxy localhost:7775 } # Management routes → port 7778 handle /api/guardrails* { reverse_proxy localhost:7778 } handle /api/policies* { reverse_proxy localhost:7778 } handle /api/auth* { reverse_proxy localhost:7778 } handle /api/enterprise* { reverse_proxy localhost:7778 } handle /api/routing* { reverse_proxy localhost:7778 } handle /api/mcp/tools* { reverse_proxy localhost:7778 } handle /api/mcp/auth* { reverse_proxy localhost:7778 } handle /api/admin/mcp* { reverse_proxy localhost:7778 } handle /api/metacache* { reverse_proxy localhost:7778 } # Compliance → port 7777, Policy Perfect → port 7782 handle /api/compliance* { reverse_proxy localhost:7777 } handle /api/policy* { reverse_proxy localhost:7782 } # Chat UI handle { reverse_proxy localhost:3600 } # Performance encode gzip header Strict-Transport-Security "max-age=31536000; includeSubDomains" }
handle /api/foo* (no slash before *), not
handle /api/foo/*. The bare form also matches the path without a trailing
slash (e.g. GET /api/foo), which several Smartflow endpoints use.
Always configure health checks so Docker can automatically restart unhealthy containers:
healthcheck: test: ["CMD", "curl", "-sf", "http://localhost:7775/health"] interval: 30s timeout: 5s retries: 3 start_period: 20s # allow binary startup time
SERVICE_TYPE — which binary to run (proxy, api-server, compliance, policy-perfect)KEYSTORE_REDIS_URL — Redis connection string for virtual key storeDATABASE_URL — TimescaleDB/PostgreSQL DSN for VAS audit logsADMIN_API_KEY — internal management key; keep out of client-facing envMCP_GATEWAY_ENABLED=true — activate MCP tool call cache routesRATE_LIMIT_REQUESTS_PER_HOUR — per-key hourly rate capSEMANTIC_CACHE_THRESHOLD — VectorLite similarity threshold (default 0.90)COMPLIANCE_TIMEOUT_SECS — async compliance check timeout (default 8)
The Smartflow Helm chart (helm/smartflow) deploys all four services plus
TimescaleDB, Redis, and the SafeChat frontend. The chart is designed for a 3-node cluster
with at least 4 vCPU and 8 GB RAM per node (DigitalOcean s-4vcpu-8gb or
equivalent) for comfortable production workloads.
s-2vcpu-4gb), system daemons, cert-manager,
NGINX ingress, and Smartflow pods all compete for the same 2 cores. Under load this
causes Linux CFS throttling — visible as latency spikes of 100ms or more.
Always use at least 4 vCPU nodes for production.
# Install cert-manager first (required for TLS) helm repo add jetstack https://charts.jetstack.io --force-update helm upgrade --install cert-manager jetstack/cert-manager \ --namespace cert-manager --create-namespace \ --version v1.19.4 \ --set crds.enabled=true \ --set resources.requests.cpu=10m \ --set resources.requests.memory=64Mi \ --set resources.limits.cpu=100m \ --set resources.limits.memory=128Mi \ --set webhook.resources.requests.cpu=10m \ --set webhook.resources.requests.memory=32Mi \ --set webhook.resources.limits.cpu=50m \ --set webhook.resources.limits.memory=64Mi \ --set cainjector.resources.requests.cpu=10m \ --set cainjector.resources.requests.memory=32Mi \ --set cainjector.resources.limits.cpu=50m \ --set cainjector.resources.limits.memory=64Mi \ --set webhook.timeoutSeconds=29 # Install NGINX ingress controller helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx --create-namespace # Deploy Smartflow helm upgrade --install smartflow ./helm/smartflow \ --namespace smartflow --create-namespace \ -f values.yaml
# helm/smartflow/values.yaml — production settings proxy: replicas: 2 image: repository: langsmartai/safechat-enterprise tag: latest pullPolicy: Always resources: requests: cpu: 200m memory: 256Mi limits: # No CPU limit — avoids CFS throttle jitter (see Perf section) memory: 512Mi apiServer: replicas: 2 resources: requests: cpu: 100m memory: 128Mi limits: memory: 256Mi compliance: replicas: 1 resources: requests: cpu: 100m memory: 256Mi limits: memory: 512Mi policyPerfect: replicas: 1 resources: requests: cpu: 50m memory: 128Mi limits: memory: 256Mi ingress: enabled: true host: smartflow.your-domain.com tls: true clusterIssuer: letsencrypt-prod annotations: # Disable buffering — critical for streaming LLM responses nginx.ingress.kubernetes.io/proxy-buffering: "off" nginx.ingress.kubernetes.io/proxy-request-buffering: "off" # Large body support for file/image uploads nginx.ingress.kubernetes.io/proxy-body-size: "50m" # Long timeouts for streaming responses nginx.ingress.kubernetes.io/proxy-read-timeout: "300" nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
Smartflow uses cert-manager for automatic TLS certificate issuance and renewal via Let's Encrypt. The ClusterIssuer must be configured before enabling TLS on the ingress.
# Apply after cert-manager is running apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: ops@your-domain.com privateKeySecretRef: name: letsencrypt-prod-key solvers: - http01: ingress: class: nginx
DigitalOcean's cluster upgrade process requires all admission webhook timeouts to be
between 1 and 29 seconds. The default cert-manager install sets this to 30s, which
blocks node upgrades. The Helm install command above sets webhook.timeoutSeconds=29
to handle this automatically. Verify with:
kubectl get validatingwebhookconfiguration cert-manager-webhook \
-o jsonpath='{.webhooks[*].timeoutSeconds}'
# Should output: 29
Out of the box, a Kubernetes deployment adds 2–5ms of per-request overhead compared to bare metal, primarily from the overlay network, kube-proxy iptables chains, and NGINX ingress buffering. The configuration in this guide reduces that to approximately 1–2ms by applying the following tunings.
Every request traverses: ingress → kube-proxy iptables → ClusterIP → pod overlay. That's 2–4 extra network stack transitions vs. localhost on bare metal. Each adds ~0.5–2ms. Mitigated by: externalTrafficPolicy: Local + node affinity
By default NGINX buffers the full request body before forwarding it to the upstream pod. For large AI payloads or streaming responses this introduces measurable latency. Mitigated by: proxy-buffering: off annotation
When a pod exceeds its CPU limit, Linux's CFS scheduler throttles it for up to 100ms per scheduling period — even for brief bursts during token generation. Mitigated by: removing CPU limits on proxy pods
Service-to-service calls (proxy → api-server → compliance) each perform a kube-dns lookup. On bare metal these are localhost calls with zero DNS overhead. Mitigated by: ndots:2 + dnsConfig tuning
1. externalTrafficPolicy: Local
Eliminates the SNAT hop for external traffic. Requires pods to be scheduled on the same node as the ingress — pair with node affinity.
service: externalTrafficPolicy: Local
2. DNS ndots tuning
Reduces unnecessary DNS search-path lookups on every service call by lowering
ndots from the default of 5 to 2.
dnsConfig: options: - name: ndots value: "2"
3. Streaming-safe ingress annotations
Disabling NGINX request and response buffering is essential for SSE / streaming chat completions — tokens arrive incrementally and must not be held in a buffer.
proxy-buffering: "off" proxy-request-buffering: "off" proxy-read-timeout: "300"
4. Node affinity for Smartflow pods
Pin Smartflow pods to dedicated nodes so cert-manager and ingress do not compete for CPU on the same node.
affinity: nodeAffinity: requiredDuringScheduling...: nodeSelectorTerms: - matchExpressions: - key: role operator: In values: [smartflow]
| Metric | Bare Metal (baseline) | K8s Default | K8s Tuned |
|---|---|---|---|
| p50 overhead (proxy hop) | ~0ms | 3–5ms | ~1ms |
| p99 overhead (loaded) | <2ms | 15–100ms (throttle) | 3–5ms |
| Streaming first-token | Immediate | Buffered until body complete | Immediate (buffering off) |
| Scale-out | Manual | HPA auto-scale | HPA auto-scale |
| Zero-downtime deploy | Service interruption | Rolling update | Rolling update |
ADMIN_API_KEY, provider keys, and DB passwordsADMIN_API_KEY quarterly; it gates the management APIsk-sf-*) are the only credentials clients should ever handle:7778) should not be exposed via the public ingress without an additional auth layerStrict-Transport-Security: max-age=31536000sk-sf-*) isolate clients from real provider API keysAuthorization: Bearer and x-api-key headersPOST /api/enterprise/vkeys/{id}/revokessh -i ~/.ssh/dda_deploy_key root@192.81.214.94/root/.cargo/bin/cargo build --release in /opt/smartflow-source/smartflow → Smartflow_docker/smartflow-bin, api_server → api-server-binary, etc.git add <binary> && git commit && git push to SRAGroupTX/Smartflow_dockerdocker buildx build --platform linux/amd64 -t langsmartai/safechat-enterprise:latest -f Dockerfile.runtime --push .docker compose -f docker-compose.simple.yaml pull && docker compose up -dhelm upgrade smartflow ./helm/smartflow -n smartflow --reuse-valuesdocker-compose.simple.yaml uses build: directives, not
image: references. Running docker compose pull alone does
nothing. You must run docker compose build or deploy fresh from the
cloned repo so Docker builds the image locally from Dockerfile.runtime.
# Check a specific feature string is present in the running binary # (never assume the running container matches the source code) docker exec smartflow-proxy strings /usr/local/bin/smartflow | grep "semantic_cache" # On Kubernetes kubectl exec -n smartflow deploy/smartflow-proxy -- \ strings /usr/local/bin/smartflow | grep "semantic_cache"
doctl kubernetes cluster lint <cluster-id> (or DigitalOcean's
clusterlint UI). Resolve all issues before starting the upgrade — especially webhook
timeout warnings, which will block node drains. Webhook timeouts must be ≤29s;
the Helm install above ensures this for cert-manager.
# Verify cert-manager webhook timeout before upgrade kubectl get validatingwebhookconfiguration cert-manager-webhook \ -o jsonpath='{.webhooks[*].timeoutSeconds}' # Expected: 29 # Verify all Smartflow pods are healthy kubectl get pods -n smartflow kubectl top pods -n smartflow # check for memory pressure
# Manual scale — proxy pods kubectl scale deploy smartflow-proxy -n smartflow --replicas=3 # HPA — autoscale proxy between 2 and 8 replicas based on CPU kubectl autoscale deploy smartflow-proxy -n smartflow \ --cpu-percent=60 --min=2 --max=8
# Bash quick test export SMARTFLOW_HOST=https://smartflow.your-domain.com export VIRTUAL_KEY=sk-sf-your-key bash smartflow_integration_test.sh # Full Python suite (requires httpx, openai, anthropic) python3 smartflow_integration_test.py
| Endpoint | Port | Returns |
|---|---|---|
GET /health | 7775 (proxy) | Proxy liveness, provider connectivity |
GET /api/health/comprehensive | 7778 (api-server) | All services, Redis, DB |
GET /api/providers/perf | 7778 | Per-provider latency + error rates |
GET /api/metacache/stats | 7778 | 4-phase cache hit rates, savings |
GET /api/mcp/cache/stats | 7775 | MCP tool call cache stats (requires MCP_GATEWAY_ENABLED=true) |
GET /api/compliance/intelligent/health | 7777 | ML compliance engine status |
Monitor /api/metacache/stats. Expect Phase 4 semantic hit rate > 40% for repetitive workloads. If hit rate drops, check SEMANTIC_CACHE_THRESHOLD configuration.
Poll GET /api/enterprise/vkeys/{id}/budget for per-key spend tracking. Alert if any key approaches its budget ceiling to avoid unexpected 429 responses to clients.
Check /api/providers/perf for latency and error rate per provider. Smartflow's intelligent routing will deprioritize degraded providers automatically, but monitoring gives early warning.
Run kubectl top pods -n smartflow regularly. The compliance pod
holds ML models in memory — if it approaches its limit, increase compliance.resources.limits.memory.
VectorLite semantic cache, virtual key budgets, and rate limiting all depend on Redis. A Redis outage degrades to no caching but should not hard-fail requests. Check KEYSTORE_REDIS_URL if cache stats return zeros.
The VAS audit log writes to TimescaleDB on every request. Monitor DB disk usage — TimescaleDB's automatic chunk compression keeps this manageable but the volume is proportional to traffic.