LangSmart Smartflow · Ops & Infrastructure

Docker & Kubernetes
Deployment Best Practices

Production-hardened configuration for Smartflow on Docker Compose and Kubernetes — covering resource tuning, ingress optimization, TLS automation, performance configuration, and a complete operational runbook.

Production Ready Docker Compose Kubernetes / Helm cert-manager v1.19 Performance Tuned
Deployment Options at a Glance
Deployment Model Best For HA / Scaling Latency Profile Complexity
Docker Compose (bare metal) Single-tenant, pilot, dev Manual / none Lowest — direct localhost routing Low
Docker Compose (VM) Small teams, fixed workload Manual restart only Low — Caddy reverse proxy Low
Kubernetes / Helm Enterprise, multi-tenant, auto-scale Full — HPA, rolling upgrades +2–5ms overlay (tunable to ~+1ms) Medium
Choosing your path
For proof-of-concept or single-customer deployments Docker Compose is faster to stand up and has lower per-request latency. For production multi-tenant environments, Kubernetes provides automatic failover, rolling upgrades without downtime, and horizontal scaling — well worth the small latency delta after tuning.
Docker Compose — Production Configuration

The canonical compose file for Smartflow is docker-compose.simple.yaml in the Smartflow_docker repo. All four services share a single Docker image (langsmartai/safechat-enterprise:latest); the SERVICE_TYPE environment variable selects which binary runs.

Service Architecture
smartflow-proxy

Main LLM gateway. Handles all /v1/*, /anthropic/*, /cursor/*, and MCP routes.
Port 7775

smartflow-api-server

Management plane — virtual keys, guardrails, routing policies, VAS audit logs, analytics.
Port 7778

smartflow-compliance

ML-powered compliance engine — intelligent scanning, learning feedback, org baselines.
Port 7777

smartflow-policy-perfect

Policy evaluation service for guardrail rule sets and advanced decision trees.
Port 7782

smartflow-chat

SafeChat Enterprise frontend — served by the chat container.
Port 3600

Resource Limits for Compose

Always set mem_limit and cpus constraints in Compose. Without them a runaway request loop (e.g. a streaming response that never terminates) can exhaust host memory and take down all services.

# docker-compose.simple.yaml — recommended resource constraints
services:
  smartflow-proxy:
    mem_limit: 1g
    memswap_limit: 1g       # disable swap for this container
    cpus: 2.0              # allow up to 2 cores; adjust for your host
    restart: unless-stopped
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

  smartflow-api-server:
    mem_limit: 512m
    cpus: 1.0
    restart: unless-stopped

  smartflow-compliance:
    mem_limit: 768m         # compliance ML models need headroom
    cpus: 1.5
    restart: unless-stopped

  smartflow-policy-perfect:
    mem_limit: 256m
    cpus: 0.5
    restart: unless-stopped
Caddy Reverse Proxy (bare metal / VM)

Caddy handles TLS termination and routes to the correct service port. Its automatic HTTPS via Let's Encrypt requires ports 80 and 443 to be open and the DNS A record to resolve to this host before Caddy starts.

# /etc/caddy/Caddyfile — production routing rules
your-host.example.com {
    # Proxy routes → port 7775
    handle /v1/*            { reverse_proxy localhost:7775 }
    handle /anthropic/*     { reverse_proxy localhost:7775 }
    handle /cursor/*        { reverse_proxy localhost:7775 }
    handle /a2a/*           { reverse_proxy localhost:7775 }
    handle /api/mcp/*       { reverse_proxy localhost:7775 }
    handle /.well-known/*   { reverse_proxy localhost:7775 }

    # Management routes → port 7778
    handle /api/guardrails*  { reverse_proxy localhost:7778 }
    handle /api/policies*    { reverse_proxy localhost:7778 }
    handle /api/auth*        { reverse_proxy localhost:7778 }
    handle /api/enterprise*  { reverse_proxy localhost:7778 }
    handle /api/routing*     { reverse_proxy localhost:7778 }
    handle /api/mcp/tools*   { reverse_proxy localhost:7778 }
    handle /api/mcp/auth*    { reverse_proxy localhost:7778 }
    handle /api/admin/mcp*   { reverse_proxy localhost:7778 }
    handle /api/metacache*   { reverse_proxy localhost:7778 }

    # Compliance → port 7777, Policy Perfect → port 7782
    handle /api/compliance*  { reverse_proxy localhost:7777 }
    handle /api/policy*      { reverse_proxy localhost:7782 }

    # Chat UI
    handle { reverse_proxy localhost:3600 }

    # Performance
    encode gzip
    header Strict-Transport-Security "max-age=31536000; includeSubDomains"
}
Use bare wildcard paths — not trailing-slash wildcards
Write handle /api/foo* (no slash before *), not handle /api/foo/*. The bare form also matches the path without a trailing slash (e.g. GET /api/foo), which several Smartflow endpoints use.
Health Checks

Always configure health checks so Docker can automatically restart unhealthy containers:

healthcheck:
  test: ["CMD", "curl", "-sf", "http://localhost:7775/health"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 20s    # allow binary startup time
Key Environment Variables
Required for all services
  • SERVICE_TYPE — which binary to run (proxy, api-server, compliance, policy-perfect)
  • KEYSTORE_REDIS_URL — Redis connection string for virtual key store
  • DATABASE_URL — TimescaleDB/PostgreSQL DSN for VAS audit logs
  • ADMIN_API_KEY — internal management key; keep out of client-facing env
Feature flags
  • MCP_GATEWAY_ENABLED=true — activate MCP tool call cache routes
  • RATE_LIMIT_REQUESTS_PER_HOUR — per-key hourly rate cap
  • SEMANTIC_CACHE_THRESHOLD — VectorLite similarity threshold (default 0.90)
  • COMPLIANCE_TIMEOUT_SECS — async compliance check timeout (default 8)
Kubernetes — Helm Deployment

The Smartflow Helm chart (helm/smartflow) deploys all four services plus TimescaleDB, Redis, and the SafeChat frontend. The chart is designed for a 3-node cluster with at least 4 vCPU and 8 GB RAM per node (DigitalOcean s-4vcpu-8gb or equivalent) for comfortable production workloads.

Minimum Cluster Sizing
3
Nodes (minimum)
4 vCPU
Per node (recommended)
8 GB
RAM per node
50 GB
SSD per node
1.29+
Kubernetes version
2 vCPU nodes are too small
On 2 vCPU nodes (e.g. DO s-2vcpu-4gb), system daemons, cert-manager, NGINX ingress, and Smartflow pods all compete for the same 2 cores. Under load this causes Linux CFS throttling — visible as latency spikes of 100ms or more. Always use at least 4 vCPU nodes for production.
Helm Install
# Install cert-manager first (required for TLS)
helm repo add jetstack https://charts.jetstack.io --force-update
helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version v1.19.4 \
  --set crds.enabled=true \
  --set resources.requests.cpu=10m \
  --set resources.requests.memory=64Mi \
  --set resources.limits.cpu=100m \
  --set resources.limits.memory=128Mi \
  --set webhook.resources.requests.cpu=10m \
  --set webhook.resources.requests.memory=32Mi \
  --set webhook.resources.limits.cpu=50m \
  --set webhook.resources.limits.memory=64Mi \
  --set cainjector.resources.requests.cpu=10m \
  --set cainjector.resources.requests.memory=32Mi \
  --set cainjector.resources.limits.cpu=50m \
  --set cainjector.resources.limits.memory=64Mi \
  --set webhook.timeoutSeconds=29

# Install NGINX ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace

# Deploy Smartflow
helm upgrade --install smartflow ./helm/smartflow \
  --namespace smartflow --create-namespace \
  -f values.yaml
Recommended values.yaml
# helm/smartflow/values.yaml — production settings

proxy:
  replicas: 2
  image:
    repository: langsmartai/safechat-enterprise
    tag: latest
    pullPolicy: Always
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:              # No CPU limit — avoids CFS throttle jitter (see Perf section)
      memory: 512Mi

apiServer:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      memory: 256Mi

compliance:
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      memory: 512Mi

policyPerfect:
  replicas: 1
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      memory: 256Mi

ingress:
  enabled: true
  host: smartflow.your-domain.com
  tls: true
  clusterIssuer: letsencrypt-prod
  annotations:
    # Disable buffering — critical for streaming LLM responses
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
    # Large body support for file/image uploads
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    # Long timeouts for streaming responses
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
No CPU limit on proxy — intentional
CPU limits trigger Linux CFS (Completely Fair Scheduler) throttling. Even if a pod uses its full CPU limit for only 1ms during a token generation burst, the scheduler can throttle it for up to 100ms before the next period. For latency-sensitive AI proxying, set only CPU requests (for scheduler placement) and omit limits for the proxy and api-server pods. Memory limits are still required.
TLS & cert-manager Configuration

Smartflow uses cert-manager for automatic TLS certificate issuance and renewal via Let's Encrypt. The ClusterIssuer must be configured before enabling TLS on the ingress.

ClusterIssuer setup
# Apply after cert-manager is running
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@your-domain.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx
Webhook Timeout — DigitalOcean Requirement

DigitalOcean's cluster upgrade process requires all admission webhook timeouts to be between 1 and 29 seconds. The default cert-manager install sets this to 30s, which blocks node upgrades. The Helm install command above sets webhook.timeoutSeconds=29 to handle this automatically. Verify with:

kubectl get validatingwebhookconfiguration cert-manager-webhook \
  -o jsonpath='{.webhooks[*].timeoutSeconds}'
# Should output: 29
Performance Tuning

Out of the box, a Kubernetes deployment adds 2–5ms of per-request overhead compared to bare metal, primarily from the overlay network, kube-proxy iptables chains, and NGINX ingress buffering. The configuration in this guide reduces that to approximately 1–2ms by applying the following tunings.

Why the Gap Exists
Network overlay hops

Every request traverses: ingress → kube-proxy iptables → ClusterIP → pod overlay. That's 2–4 extra network stack transitions vs. localhost on bare metal. Each adds ~0.5–2ms. Mitigated by: externalTrafficPolicy: Local + node affinity

NGINX ingress buffering

By default NGINX buffers the full request body before forwarding it to the upstream pod. For large AI payloads or streaming responses this introduces measurable latency. Mitigated by: proxy-buffering: off annotation

CFS CPU throttling

When a pod exceeds its CPU limit, Linux's CFS scheduler throttles it for up to 100ms per scheduling period — even for brief bursts during token generation. Mitigated by: removing CPU limits on proxy pods

kube-dns latency

Service-to-service calls (proxy → api-server → compliance) each perform a kube-dns lookup. On bare metal these are localhost calls with zero DNS overhead. Mitigated by: ndots:2 + dnsConfig tuning

Applied Optimizations

1. externalTrafficPolicy: Local

Eliminates the SNAT hop for external traffic. Requires pods to be scheduled on the same node as the ingress — pair with node affinity.

service:
  externalTrafficPolicy: Local

2. DNS ndots tuning

Reduces unnecessary DNS search-path lookups on every service call by lowering ndots from the default of 5 to 2.

dnsConfig:
  options:
    - name: ndots
      value: "2"

3. Streaming-safe ingress annotations

Disabling NGINX request and response buffering is essential for SSE / streaming chat completions — tokens arrive incrementally and must not be held in a buffer.

proxy-buffering: "off"
proxy-request-buffering: "off"
proxy-read-timeout: "300"

4. Node affinity for Smartflow pods

Pin Smartflow pods to dedicated nodes so cert-manager and ingress do not compete for CPU on the same node.

affinity:
  nodeAffinity:
    requiredDuringScheduling...:
      nodeSelectorTerms:
        - matchExpressions:
            - key: role
              operator: In
              values: [smartflow]
Expected Latency After Tuning
MetricBare Metal (baseline)K8s DefaultK8s Tuned
p50 overhead (proxy hop)~0ms3–5ms~1ms
p99 overhead (loaded)<2ms15–100ms (throttle)3–5ms
Streaming first-tokenImmediateBuffered until body completeImmediate (buffering off)
Scale-outManualHPA auto-scaleHPA auto-scale
Zero-downtime deployService interruptionRolling updateRolling update
Security Hardening
Secrets management
  • Never bake API keys into Docker images or Helm chart defaults
  • Use Kubernetes Secrets (or a secrets manager like Vault / DO Secrets) for ADMIN_API_KEY, provider keys, and DB passwords
  • Rotate the ADMIN_API_KEY quarterly; it gates the management API
  • Virtual keys (sk-sf-*) are the only credentials clients should ever handle
Network policies
  • The proxy pod must reach the api-server, compliance, and policy pods — all within the cluster
  • Only the ingress controller should route external traffic to the proxy
  • TimescaleDB and Redis should have a NetworkPolicy that denies all ingress except from Smartflow pods
  • The management API (:7778) should not be exposed via the public ingress without an additional auth layer
TLS everywhere
  • All external traffic terminates TLS at the ingress (cert-manager + Let's Encrypt)
  • Enforce HSTS: Strict-Transport-Security: max-age=31536000
  • For inter-pod traffic, Kubernetes service meshes (Istio/Linkerd) can add mTLS — optional but recommended for high-compliance deployments
  • Keep cert-manager at a supported release; webhook timeouts must stay ≤29s on DigitalOcean
Virtual key security model
  • Virtual keys (sk-sf-*) isolate clients from real provider API keys
  • Each key carries an optional budget cap, rate limit, and policy binding
  • The proxy enforces budget from both Authorization: Bearer and x-api-key headers
  • Revoke compromised keys immediately via POST /api/enterprise/vkeys/{id}/revoke
Operational Runbook
Updating the Binary (Docker Compose)
1
Build on the build server
ssh -i ~/.ssh/dda_deploy_key root@192.81.214.94
/root/.cargo/bin/cargo build --release in /opt/smartflow-source/
2
SCP binaries to the Docker repo
Copy smartflowSmartflow_docker/smartflow-bin, api_serverapi-server-binary, etc.
3
Commit & push the binary
git add <binary> && git commit && git push to SRAGroupTX/Smartflow_docker
4
Rebuild Docker image
docker buildx build --platform linux/amd64 -t langsmartai/safechat-enterprise:latest -f Dockerfile.runtime --push .
5
Deploy
Compose: docker compose -f docker-compose.simple.yaml pull && docker compose up -d
Kubernetes: helm upgrade smartflow ./helm/smartflow -n smartflow --reuse-values
!
docker pull won't update running containers
The docker-compose.simple.yaml uses build: directives, not image: references. Running docker compose pull alone does nothing. You must run docker compose build or deploy fresh from the cloned repo so Docker builds the image locally from Dockerfile.runtime.
Verifying Binary Version in Running Container
# Check a specific feature string is present in the running binary
# (never assume the running container matches the source code)
docker exec smartflow-proxy strings /usr/local/bin/smartflow | grep "semantic_cache"

# On Kubernetes
kubectl exec -n smartflow deploy/smartflow-proxy -- \
  strings /usr/local/bin/smartflow | grep "semantic_cache"
Cluster Upgrade (DigitalOcean)
Before every cluster upgrade
Run doctl kubernetes cluster lint <cluster-id> (or DigitalOcean's clusterlint UI). Resolve all issues before starting the upgrade — especially webhook timeout warnings, which will block node drains. Webhook timeouts must be ≤29s; the Helm install above ensures this for cert-manager.
# Verify cert-manager webhook timeout before upgrade
kubectl get validatingwebhookconfiguration cert-manager-webhook \
  -o jsonpath='{.webhooks[*].timeoutSeconds}'
# Expected: 29

# Verify all Smartflow pods are healthy
kubectl get pods -n smartflow
kubectl top pods -n smartflow   # check for memory pressure
Scaling
# Manual scale — proxy pods
kubectl scale deploy smartflow-proxy -n smartflow --replicas=3

# HPA — autoscale proxy between 2 and 8 replicas based on CPU
kubectl autoscale deploy smartflow-proxy -n smartflow \
  --cpu-percent=60 --min=2 --max=8
Running Integration Tests
# Bash quick test
export SMARTFLOW_HOST=https://smartflow.your-domain.com
export VIRTUAL_KEY=sk-sf-your-key
bash smartflow_integration_test.sh

# Full Python suite (requires httpx, openai, anthropic)
python3 smartflow_integration_test.py
Monitoring & Observability
Health Endpoints
EndpointPortReturns
GET /health7775 (proxy)Proxy liveness, provider connectivity
GET /api/health/comprehensive7778 (api-server)All services, Redis, DB
GET /api/providers/perf7778Per-provider latency + error rates
GET /api/metacache/stats77784-phase cache hit rates, savings
GET /api/mcp/cache/stats7775MCP tool call cache stats (requires MCP_GATEWAY_ENABLED=true)
GET /api/compliance/intelligent/health7777ML compliance engine status
Key Metrics to Watch
Cache efficiency

Monitor /api/metacache/stats. Expect Phase 4 semantic hit rate > 40% for repetitive workloads. If hit rate drops, check SEMANTIC_CACHE_THRESHOLD configuration.

Virtual key spend

Poll GET /api/enterprise/vkeys/{id}/budget for per-key spend tracking. Alert if any key approaches its budget ceiling to avoid unexpected 429 responses to clients.

Provider health

Check /api/providers/perf for latency and error rate per provider. Smartflow's intelligent routing will deprioritize degraded providers automatically, but monitoring gives early warning.

Memory headroom

Run kubectl top pods -n smartflow regularly. The compliance pod holds ML models in memory — if it approaches its limit, increase compliance.resources.limits.memory.

Redis connectivity

VectorLite semantic cache, virtual key budgets, and rate limiting all depend on Redis. A Redis outage degrades to no caching but should not hard-fail requests. Check KEYSTORE_REDIS_URL if cache stats return zeros.

Audit log pipeline

The VAS audit log writes to TimescaleDB on every request. Monitor DB disk usage — TimescaleDB's automatic chunk compression keeps this manageable but the volume is proportional to traffic.