Operations Guide

Operations Guide

Deploying, configuring, and monitoring CellState in production.

Single Binary Deployment

CellState compiles to a single binary (cellstate-api) that includes the HTTP server, all background jobs, and the LMDB cache layer. The only external dependency is PostgreSQL 18 with pgvector.

cargo build --release -p cellstate-api
# Binary: target/release/cellstate-api

Environment Variables

Required

VariableDescriptionExample
DATABASE_URLPostgreSQL connection stringpostgres://cellstate:pass@localhost/cellstate
CELLSTATE_JWT_SECRETJWT signing secret (must be set in staging/production)your-secret-key
CELLSTATE_ENVIRONMENTdevelopment, staging, or productionproduction

Server

VariableDefaultDescription
CELLSTATE_HOST0.0.0.0Bind address
CELLSTATE_PORT3000Bind port
RUST_LOGcellstate_api=infoLog level filter

Database

VariableDefaultDescription
CELLSTATE_DB_HOSTlocalhostPostgreSQL host
CELLSTATE_DB_PORT5432PostgreSQL port
CELLSTATE_DB_NAMEcellstateDatabase name
CELLSTATE_DB_USERcellstateDatabase user
CELLSTATE_DB_MAX_CONNECTIONS10Connection pool size

Cache (LMDB)

VariableDefaultDescription
CELLSTATE_CACHE_PATH/tmp/cellstate-cacheLMDB storage directory
CELLSTATE_CACHE_SIZE_MB256Maximum cache size
CELLSTATE_CACHE_MAX_STALENESS_SECS60Cache staleness tolerance
CELLSTATE_CACHE_TTL_SECS3600Cache entry TTL
CELLSTATE_CACHE_MAX_ENTRIES10000Maximum cache entries

Event DAG

VariableDefaultDescription
CELLSTATE_EVENT_DAG_PATH/tmp/cellstate-event-dagLMDB path for event DAG

Context Assembly

VariableDefaultDescription
CELLSTATE_CONTEXT_REST_TOKEN_BUDGET8000Default token budget
CELLSTATE_CONTEXT_MAX_NOTES10Max notes per assembly
CELLSTATE_CONTEXT_MAX_ARTIFACTS5Max artifacts per assembly
CELLSTATE_CONTEXT_MAX_TURNS20Max turns per assembly
CELLSTATE_CONTEXT_CACHE_MAX_ENTRIES10000Context cache size
CELLSTATE_CONTEXT_CACHE_TTL_SECS300Context cache TTL

CORS & Rate Limiting

VariableDefaultDescription
CELLSTATE_CORS_ORIGINS(empty = allow all)Comma-separated allowed origins
CELLSTATE_RATE_LIMIT_ENABLEDtrueEnable rate limiting
CELLSTATE_RATE_LIMIT_UNAUTHENTICATED100Requests/min per IP
CELLSTATE_RATE_LIMIT_AUTHENTICATED1000Requests/min per tenant
CELLSTATE_RATE_LIMIT_BURST10Burst capacity

Idempotency

VariableDefaultDescription
CELLSTATE_IDEMPOTENCY_REQUIRE_KEYtrueRequire idempotency keys on mutations
CELLSTATE_IDEMPOTENCY_TTL_SECS86400Key TTL

Embedding Providers

VariableDescription
CELLSTATE_OPENAI_API_KEY or OPENAI_API_KEYOpenAI API key
CELLSTATE_EMBEDDING_MODELModel name (default: text-embedding-3-small)
CELLSTATE_OPENROUTER_API_KEYOpenRouter API key
CELLSTATE_OPENROUTER_EMBEDDING_MODELOpenRouter model (e.g., openai/text-embedding-3-small)
CELLSTATE_OLLAMA_EMBEDDING_MODELOllama model (default: nomic-embed-text)
CELLSTATE_EMBEDDING_ROUTINGfirst, round_robin, least_latency, random

Multi-Instance

VariableDefaultDescription
CELLSTATE_EXPECT_MULTI_INSTANCEfalseEnable multi-instance safety checks
CELLSTATE_PG_JOURNAL_POLLER_DISABLEDfalseDisable cross-instance cache invalidation
CELLSTATE_INSTANCE_ID(auto)Unique instance identifier

Security

VariableDefaultDescription
CELLSTATE_OAUTH_VAULT_ENCRYPTION_KEYS_JSON(none)Required in staging/production. JSON array of encryption keys for the OAuth token vault. Server refuses to start without it in non-development environments.
CELLSTATE_WS_CAPACITY1000WebSocket broadcast channel capacity. Max concurrent connections hard-capped at 10,000.
LEMONSQUEEZY_WEBHOOK_SECRET(none)Webhook HMAC secret for billing events. Empty strings are rejected.

Observability

VariableDefaultDescription
CELLSTATE_METRICS_ENABLEDfalseEnable Prometheus metrics on /metrics
CELLSTATE_METRICS_AUTH_TOKEN(none)Bearer token for /metrics endpoint
CELLSTATE_OTLP_ENDPOINT(none)OTLP exporter endpoint for traces
CELLSTATE_TRACE_SAMPLE_RATE0.1Trace sampling rate (0.0–1.0)
SENTRY_DSN(none)Sentry error tracking DSN

Deployment Platforms

Example configurations are provided in the deployment examples in the CellState source repository:

PlatformPathNotes
Linodeexamples/deploy/linode/Bare metal + Cloudflare Tunnel, ~$41/mo
Fly.ioexamples/deploy/fly.io/Managed containers with Postgres
Railwayexamples/deploy/railway/One-click deploy
Kubernetesexamples/deploy/helm/Helm chart with HPA, PDB, ServiceMonitor

Systemd (Bare Metal)

The Linode example includes a hardened systemd unit at the Linode systemd unit in the CellState source repository with:

  • Filesystem and kernel hardening (ProtectSystem=strict, syscall filtering)
  • Memory limits (MemoryMax=2G)
  • OOM score adjustment (prefer killing app over PostgreSQL)
  • Automatic restart on failure
sudo cp cellstate.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now cellstate

Background Jobs

The server runs 14 supervised background jobs:

JobPurposeConfig Env Prefix
change_relayPolls DB changes, broadcasts via WebSocketCELLSTATE_CHANGE_RELAY_*
saga_cleanupCleans stuck delegations/handoffsCELLSTATE_SAGA_CLEANUP_*
summarization_executorProcesses pending summarization requestsCELLSTATE_SUMMARIZATION_*
ttl_cleanupEnforces memory TTL policiesCELLSTATE_TTL_CLEANUP_*
scope_turn_reaperCloses turns when scopes closeCELLSTATE_SCOPE_TURN_REAPER_*
apology_minerDetects context drift language patternsCELLSTATE_APOLOGY_MINER_*
memory_decayWarmth decay and cold memory archivalCELLSTATE_MEMORY_DECAY_*
hash_chain_auditEvent DAG integrity verificationCELLSTATE_HASH_CHAIN_AUDIT_*
agent_deliberationBDI engine tick loopCELLSTATE_DELIBERATION_*
drift_detectionAgent pair alignment checkingCELLSTATE_DRIFT_*
killed_agents_syncCross-instance killed agent propagationCELLSTATE_KILLED_AGENTS_*
agent_schedulerCron-based agent trigger dispatchCELLSTATE_SCHEDULER_*
oauth_refreshToken refresh with lock + backoffCELLSTATE_OAUTH_*
metrics_updaterPeriodic entity gauge updates(15s interval, not configurable)

All jobs use the supervisor pattern with automatic restart on panic. Job health is exposed at /health/ready (with credentials).

Monitoring

Prometheus Metrics

Enable with CELLSTATE_METRICS_ENABLED=true. Metrics are exposed at GET /metrics.

Optionally protect with CELLSTATE_METRICS_AUTH_TOKEN=your-token (requires Authorization: Bearer your-token).

OTLP Traces

Set CELLSTATE_OTLP_ENDPOINT to export traces via OpenTelemetry Protocol (e.g., to Jaeger, Grafana Tempo, or Honeycomb).

Health Endpoints

EndpointAuthPurpose
GET /health/pingNoneLiveness probe
GET /health/liveNoneProcess alive check
GET /health/readyOptionalReadiness probe (database + job health with credentials)

PostgreSQL Setup

CellState requires PostgreSQL 18 with the pgvector extension.

# Install pgvector (Ubuntu/Debian)
sudo apt install postgresql-18-pgvector

# Create database and user
sudo -u postgres createuser cellstate
sudo -u postgres createdb -O cellstate cellstate
sudo -u postgres psql -d cellstate -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Apply schema and migrations
make db-migrate

Backups

The Linode deployment example includes a WAL backup script at examples/deploy/linode/pg-backup.sh for streaming backups to Object Storage.

Troubleshooting

Server won’t start: “PCP config load failed” Run make db-migrate to initialize the schema and seed the config table.

“pgvector extension not found” Install postgresql-18-pgvector and run CREATE EXTENSION vector in the cellstate database.

Cache errors on startup Ensure the CELLSTATE_CACHE_PATH directory exists and is writable. Default is /tmp/cellstate-cache.

Multi-instance cache drift Enable CELLSTATE_EXPECT_MULTI_INSTANCE=true and ensure CELLSTATE_PG_JOURNAL_POLLER_DISABLED is not set. The PG change journal poller provides ~2s staleness for cross-instance cache invalidation.