LangSmart Smartflow · Ops & Infrastructure

Docker & Kubernetes
Deployment Best Practices

Production-hardened configuration for Smartflow on Docker Compose and Kubernetes — covering resource tuning, ingress optimization, TLS automation, performance configuration, and a complete operational runbook.

Production Ready Docker Compose Kubernetes / Helm cert-manager v1.19 Performance Tuned

Deployment Options at a Glance

Deployment Model	Best For	HA / Scaling	Latency Profile	Complexity
Docker Compose (bare metal)	Single-tenant, pilot, dev	Manual / none	Lowest — direct localhost routing	Low
Docker Compose (VM)	Small teams, fixed workload	Manual restart only	Low — Caddy reverse proxy	Low
Kubernetes / Helm	Enterprise, multi-tenant, auto-scale	Full — HPA, rolling upgrades	+2–5ms overlay (tunable to ~+1ms)	Medium

ℹ

Choosing your path

For proof-of-concept or single-customer deployments Docker Compose is faster to stand up and has lower per-request latency. For production multi-tenant environments, Kubernetes provides automatic failover, rolling upgrades without downtime, and horizontal scaling — well worth the small latency delta after tuning.

Docker Compose — Production Configuration

The canonical compose file for Smartflow is docker-compose.simple.yaml in the Smartflow_docker repo. All four services share a single Docker image (langsmartai/safechat-enterprise:latest); the SERVICE_TYPE environment variable selects which binary runs.

Service Architecture

smartflow-proxy

Main LLM gateway. Handles all /v1/*, /anthropic/*, /cursor/*, and MCP routes.
Port 7775

smartflow-api-server

Management plane — virtual keys, guardrails, routing policies, VAS audit logs, analytics.
Port 7778

smartflow-compliance

ML-powered compliance engine — intelligent scanning, learning feedback, org baselines.
Port 7777

smartflow-policy-perfect

Policy evaluation service for guardrail rule sets and advanced decision trees.
Port 7782

smartflow-chat

SafeChat Enterprise frontend — served by the chat container.
Port 3600

Resource Limits for Compose

Always set mem_limit and cpus constraints in Compose. Without them a runaway request loop (e.g. a streaming response that never terminates) can exhaust host memory and take down all services.

# docker-compose.simple.yaml — recommended resource constraints
services:
  smartflow-proxy:
    mem_limit: 1g
    memswap_limit: 1g       # disable swap for this container
    cpus: 2.0              # allow up to 2 cores; adjust for your host
    restart: unless-stopped
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

  smartflow-api-server:
    mem_limit: 512m
    cpus: 1.0
    restart: unless-stopped

  smartflow-compliance:
    mem_limit: 768m         # compliance ML models need headroom
    cpus: 1.5
    restart: unless-stopped

  smartflow-policy-perfect:
    mem_limit: 256m
    cpus: 0.5
    restart: unless-stopped

Caddy Reverse Proxy (bare metal / VM)

Caddy handles TLS termination and routes to the correct service port. Its automatic HTTPS via Let's Encrypt requires ports 80 and 443 to be open and the DNS A record to resolve to this host before Caddy starts.

# /etc/caddy/Caddyfile — production routing rules
your-host.example.com {
    # Proxy routes → port 7775
    handle /v1/*            { reverse_proxy localhost:7775 }
    handle /anthropic/*     { reverse_proxy localhost:7775 }
    handle /cursor/*        { reverse_proxy localhost:7775 }
    handle /a2a/*           { reverse_proxy localhost:7775 }
    handle /api/mcp/*       { reverse_proxy localhost:7775 }
    handle /.well-known/*   { reverse_proxy localhost:7775 }

    # Management routes → port 7778
    handle /api/guardrails*  { reverse_proxy localhost:7778 }
    handle /api/policies*    { reverse_proxy localhost:7778 }
    handle /api/auth*        { reverse_proxy localhost:7778 }
    handle /api/enterprise*  { reverse_proxy localhost:7778 }
    handle /api/routing*     { reverse_proxy localhost:7778 }
    handle /api/mcp/tools*   { reverse_proxy localhost:7778 }
    handle /api/mcp/auth*    { reverse_proxy localhost:7778 }
    handle /api/admin/mcp*   { reverse_proxy localhost:7778 }
    handle /api/metacache*   { reverse_proxy localhost:7778 }

    # Compliance → port 7777, Policy Perfect → port 7782
    handle /api/compliance*  { reverse_proxy localhost:7777 }
    handle /api/policy*      { reverse_proxy localhost:7782 }

    # Chat UI
    handle { reverse_proxy localhost:3600 }

    # Performance
    encode gzip
    header Strict-Transport-Security "max-age=31536000; includeSubDomains"
}

⚠

Use bare wildcard paths — not trailing-slash wildcards

Write handle /api/foo* (no slash before *), not handle /api/foo/*. The bare form also matches the path without a trailing slash (e.g. GET /api/foo), which several Smartflow endpoints use.

Health Checks

Always configure health checks so Docker can automatically restart unhealthy containers:

healthcheck:
  test: ["CMD", "curl", "-sf", "http://localhost:7775/health"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 20s    # allow binary startup time

Key Environment Variables

Required for all services

SERVICE_TYPE — which binary to run (proxy, api-server, compliance, policy-perfect)
KEYSTORE_REDIS_URL — Redis connection string for virtual key store
DATABASE_URL — TimescaleDB/PostgreSQL DSN for VAS audit logs
ADMIN_API_KEY — internal management key; keep out of client-facing env

Feature flags

MCP_GATEWAY_ENABLED=true — activate MCP tool call cache routes
RATE_LIMIT_REQUESTS_PER_HOUR — per-key hourly rate cap
SEMANTIC_CACHE_THRESHOLD — VectorLite similarity threshold (default 0.90)
COMPLIANCE_TIMEOUT_SECS — async compliance check timeout (default 8)

Kubernetes — Helm Deployment

The Smartflow Helm chart (helm/smartflow) deploys all four services plus TimescaleDB, Redis, and the SafeChat frontend. The chart is designed for a 3-node cluster with at least 4 vCPU and 8 GB RAM per node (DigitalOcean s-4vcpu-8gb or equivalent) for comfortable production workloads.

Minimum Cluster Sizing

Nodes (minimum)

4 vCPU

Per node (recommended)

8 GB

RAM per node

50 GB

SSD per node

1.29+

Kubernetes version

⚠

2 vCPU nodes are too small

On 2 vCPU nodes (e.g. DO s-2vcpu-4gb), system daemons, cert-manager, NGINX ingress, and Smartflow pods all compete for the same 2 cores. Under load this causes Linux CFS throttling — visible as latency spikes of 100ms or more. Always use at least 4 vCPU nodes for production.

Helm Install

# Install cert-manager first (required for TLS)
helm repo add jetstack https://charts.jetstack.io --force-update
helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version v1.19.4 \
  --set crds.enabled=true \
  --set resources.requests.cpu=10m \
  --set resources.requests.memory=64Mi \
  --set resources.limits.cpu=100m \
  --set resources.limits.memory=128Mi \
  --set webhook.resources.requests.cpu=10m \
  --set webhook.resources.requests.memory=32Mi \
  --set webhook.resources.limits.cpu=50m \
  --set webhook.resources.limits.memory=64Mi \
  --set cainjector.resources.requests.cpu=10m \
  --set cainjector.resources.requests.memory=32Mi \
  --set cainjector.resources.limits.cpu=50m \
  --set cainjector.resources.limits.memory=64Mi \
  --set webhook.timeoutSeconds=29

# Install NGINX ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace

# Deploy Smartflow
helm upgrade --install smartflow ./helm/smartflow \
  --namespace smartflow --create-namespace \
  -f values.yaml

Recommended values.yaml

# helm/smartflow/values.yaml — production settings

proxy:
  replicas: 2
  image:
    repository: langsmartai/safechat-enterprise
    tag: latest
    pullPolicy: Always
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:              # No CPU limit — avoids CFS throttle jitter (see Perf section)
      memory: 512Mi

apiServer:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      memory: 256Mi

compliance:
  replicas: 1
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      memory: 512Mi

policyPerfect:
  replicas: 1
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      memory: 256Mi

ingress:
  enabled: true
  host: smartflow.your-domain.com
  tls: true
  clusterIssuer: letsencrypt-prod
  annotations:
    # Disable buffering — critical for streaming LLM responses
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
    # Large body support for file/image uploads
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    # Long timeouts for streaming responses
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

✓

No CPU limit on proxy — intentional

CPU limits trigger Linux CFS (Completely Fair Scheduler) throttling. Even if a pod uses its full CPU limit for only 1ms during a token generation burst, the scheduler can throttle it for up to 100ms before the next period. For latency-sensitive AI proxying, set only CPU requests (for scheduler placement) and omit limits for the proxy and api-server pods. Memory limits are still required.

TLS & cert-manager Configuration

Smartflow uses cert-manager for automatic TLS certificate issuance and renewal via Let's Encrypt. The ClusterIssuer must be configured before enabling TLS on the ingress.

ClusterIssuer setup

# Apply after cert-manager is running
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@your-domain.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx

Webhook Timeout — DigitalOcean Requirement

DigitalOcean's cluster upgrade process requires all admission webhook timeouts to be between 1 and 29 seconds. The default cert-manager install sets this to 30s, which blocks node upgrades. The Helm install command above sets webhook.timeoutSeconds=29 to handle this automatically. Verify with:

kubectl get validatingwebhookconfiguration cert-manager-webhook \
  -o jsonpath='{.webhooks[*].timeoutSeconds}'
# Should output: 29

Performance Tuning

Out of the box, a Kubernetes deployment adds 2–5ms of per-request overhead compared to bare metal, primarily from the overlay network, kube-proxy iptables chains, and NGINX ingress buffering. The configuration in this guide reduces that to approximately 1–2ms by applying the following tunings.

Why the Gap Exists

Network overlay hops

Every request traverses: ingress → kube-proxy iptables → ClusterIP → pod overlay. That's 2–4 extra network stack transitions vs. localhost on bare metal. Each adds ~0.5–2ms. Mitigated by: externalTrafficPolicy: Local + node affinity

NGINX ingress buffering

By default NGINX buffers the full request body before forwarding it to the upstream pod. For large AI payloads or streaming responses this introduces measurable latency. Mitigated by: proxy-buffering: off annotation

CFS CPU throttling

When a pod exceeds its CPU limit, Linux's CFS scheduler throttles it for up to 100ms per scheduling period — even for brief bursts during token generation. Mitigated by: removing CPU limits on proxy pods

kube-dns latency

Service-to-service calls (proxy → api-server → compliance) each perform a kube-dns lookup. On bare metal these are localhost calls with zero DNS overhead. Mitigated by: ndots:2 + dnsConfig tuning

Applied Optimizations

1. externalTrafficPolicy: Local

Eliminates the SNAT hop for external traffic. Requires pods to be scheduled on the same node as the ingress — pair with node affinity.

service:
  externalTrafficPolicy: Local

2. DNS ndots tuning

Reduces unnecessary DNS search-path lookups on every service call by lowering ndots from the default of 5 to 2.

dnsConfig:
  options:
    - name: ndots
      value: "2"

3. Streaming-safe ingress annotations

Disabling NGINX request and response buffering is essential for SSE / streaming chat completions — tokens arrive incrementally and must not be held in a buffer.

proxy-buffering: "off"
proxy-request-buffering: "off"
proxy-read-timeout: "300"

4. Node affinity for Smartflow pods

Pin Smartflow pods to dedicated nodes so cert-manager and ingress do not compete for CPU on the same node.

affinity:
  nodeAffinity:
    requiredDuringScheduling...:
      nodeSelectorTerms:
        - matchExpressions:
            - key: role
              operator: In
              values: [smartflow]

Expected Latency After Tuning

Metric	Bare Metal (baseline)	K8s Default	K8s Tuned
p50 overhead (proxy hop)	~0ms	3–5ms	~1ms
p99 overhead (loaded)	<2ms	15–100ms (throttle)	3–5ms
Streaming first-token	Immediate	Buffered until body complete	Immediate (buffering off)
Scale-out	Manual	HPA auto-scale	HPA auto-scale
Zero-downtime deploy	Service interruption	Rolling update	Rolling update

Security Hardening

Secrets management

Never bake API keys into Docker images or Helm chart defaults
Use Kubernetes Secrets (or a secrets manager like Vault / DO Secrets) for ADMIN_API_KEY, provider keys, and DB passwords
Rotate the ADMIN_API_KEY quarterly; it gates the management API
Virtual keys (sk-sf-*) are the only credentials clients should ever handle

Network policies

The proxy pod must reach the api-server, compliance, and policy pods — all within the cluster
Only the ingress controller should route external traffic to the proxy
TimescaleDB and Redis should have a NetworkPolicy that denies all ingress except from Smartflow pods
The management API (:7778) should not be exposed via the public ingress without an additional auth layer

TLS everywhere

All external traffic terminates TLS at the ingress (cert-manager + Let's Encrypt)
Enforce HSTS: Strict-Transport-Security: max-age=31536000
For inter-pod traffic, Kubernetes service meshes (Istio/Linkerd) can add mTLS — optional but recommended for high-compliance deployments
Keep cert-manager at a supported release; webhook timeouts must stay ≤29s on DigitalOcean

Virtual key security model

Virtual keys (sk-sf-*) isolate clients from real provider API keys
Each key carries an optional budget cap, rate limit, and policy binding
The proxy enforces budget from both Authorization: Bearer and x-api-key headers
Revoke compromised keys immediately via POST /api/enterprise/vkeys/{id}/revoke

Operational Runbook

Updating the Binary (Docker Compose)

Build on the build server

ssh -i ~/.ssh/dda_deploy_key root@192.81.214.94
/root/.cargo/bin/cargo build --release in /opt/smartflow-source/

SCP binaries to the Docker repo

Copy smartflow → Smartflow_docker/smartflow-bin, api_server → api-server-binary, etc.

Commit & push the binary

git add <binary> && git commit && git push to SRAGroupTX/Smartflow_docker

Rebuild Docker image

docker buildx build --platform linux/amd64 -t langsmartai/safechat-enterprise:latest -f Dockerfile.runtime --push .

Deploy

Compose: docker compose -f docker-compose.simple.yaml pull && docker compose up -d
Kubernetes: helm upgrade smartflow ./helm/smartflow -n smartflow --reuse-values

docker pull won't update running containers

The docker-compose.simple.yaml uses build: directives, not image: references. Running docker compose pull alone does nothing. You must run docker compose build or deploy fresh from the cloned repo so Docker builds the image locally from Dockerfile.runtime.

Verifying Binary Version in Running Container

# Check a specific feature string is present in the running binary
# (never assume the running container matches the source code)
docker exec smartflow-proxy strings /usr/local/bin/smartflow | grep "semantic_cache"

# On Kubernetes
kubectl exec -n smartflow deploy/smartflow-proxy -- \
  strings /usr/local/bin/smartflow | grep "semantic_cache"

Cluster Upgrade (DigitalOcean)

ℹ

Before every cluster upgrade

Run doctl kubernetes cluster lint <cluster-id> (or DigitalOcean's clusterlint UI). Resolve all issues before starting the upgrade — especially webhook timeout warnings, which will block node drains. Webhook timeouts must be ≤29s; the Helm install above ensures this for cert-manager.

# Verify cert-manager webhook timeout before upgrade
kubectl get validatingwebhookconfiguration cert-manager-webhook \
  -o jsonpath='{.webhooks[*].timeoutSeconds}'
# Expected: 29

# Verify all Smartflow pods are healthy
kubectl get pods -n smartflow
kubectl top pods -n smartflow   # check for memory pressure

Scaling

# Manual scale — proxy pods
kubectl scale deploy smartflow-proxy -n smartflow --replicas=3

# HPA — autoscale proxy between 2 and 8 replicas based on CPU
kubectl autoscale deploy smartflow-proxy -n smartflow \
  --cpu-percent=60 --min=2 --max=8

Running Integration Tests

# Bash quick test
export SMARTFLOW_HOST=https://smartflow.your-domain.com
export VIRTUAL_KEY=sk-sf-your-key
bash smartflow_integration_test.sh

# Full Python suite (requires httpx, openai, anthropic)
python3 smartflow_integration_test.py

Monitoring & Observability

Health Endpoints

Endpoint	Port	Returns
`GET /health`	7775 (proxy)	Proxy liveness, provider connectivity
`GET /api/health/comprehensive`	7778 (api-server)	All services, Redis, DB
`GET /api/providers/perf`	7778	Per-provider latency + error rates
`GET /api/metacache/stats`	7778	4-phase cache hit rates, savings
`GET /api/mcp/cache/stats`	7775	MCP tool call cache stats (requires `MCP_GATEWAY_ENABLED=true`)
`GET /api/compliance/intelligent/health`	7777	ML compliance engine status

Key Metrics to Watch

Cache efficiency

Monitor /api/metacache/stats. Expect Phase 4 semantic hit rate > 40% for repetitive workloads. If hit rate drops, check SEMANTIC_CACHE_THRESHOLD configuration.

Virtual key spend

Poll GET /api/enterprise/vkeys/{id}/budget for per-key spend tracking. Alert if any key approaches its budget ceiling to avoid unexpected 429 responses to clients.

Provider health

Check /api/providers/perf for latency and error rate per provider. Smartflow's intelligent routing will deprioritize degraded providers automatically, but monitoring gives early warning.

Memory headroom

Run kubectl top pods -n smartflow regularly. The compliance pod holds ML models in memory — if it approaches its limit, increase compliance.resources.limits.memory.

Redis connectivity

VectorLite semantic cache, virtual key budgets, and rate limiting all depend on Redis. A Redis outage degrades to no caching but should not hard-fail requests. Check KEYSTORE_REDIS_URL if cache stats return zeros.

Audit log pipeline

The VAS audit log writes to TimescaleDB on every request. Monitor DB disk usage — TimescaleDB's automatic chunk compression keeps this manageable but the volume is proportional to traffic.

Docker & KubernetesDeployment Best Practices

Docker & Kubernetes
Deployment Best Practices