Operations Guide
Operations Guide
Deploying, configuring, and monitoring CellState in production.
Single Binary Deployment
CellState compiles to a single binary (cellstate-api) that includes the HTTP server, all background jobs, and the LMDB cache layer. The only external dependency is PostgreSQL 18 with pgvector.
cargo build --release -p cellstate-api
# Binary: target/release/cellstate-api
Environment Variables
Required
| Variable | Description | Example |
|---|---|---|
DATABASE_URL | PostgreSQL connection string | postgres://cellstate:pass@localhost/cellstate |
CELLSTATE_JWT_SECRET | JWT signing secret (must be set in staging/production) | your-secret-key |
CELLSTATE_ENVIRONMENT | development, staging, or production | production |
Server
| Variable | Default | Description |
|---|---|---|
CELLSTATE_HOST | 0.0.0.0 | Bind address |
CELLSTATE_PORT | 3000 | Bind port |
RUST_LOG | cellstate_api=info | Log level filter |
Database
| Variable | Default | Description |
|---|---|---|
CELLSTATE_DB_HOST | localhost | PostgreSQL host |
CELLSTATE_DB_PORT | 5432 | PostgreSQL port |
CELLSTATE_DB_NAME | cellstate | Database name |
CELLSTATE_DB_USER | cellstate | Database user |
CELLSTATE_DB_MAX_CONNECTIONS | 10 | Connection pool size |
Cache (LMDB)
| Variable | Default | Description |
|---|---|---|
CELLSTATE_CACHE_PATH | /tmp/cellstate-cache | LMDB storage directory |
CELLSTATE_CACHE_SIZE_MB | 256 | Maximum cache size |
CELLSTATE_CACHE_MAX_STALENESS_SECS | 60 | Cache staleness tolerance |
CELLSTATE_CACHE_TTL_SECS | 3600 | Cache entry TTL |
CELLSTATE_CACHE_MAX_ENTRIES | 10000 | Maximum cache entries |
Event DAG
| Variable | Default | Description |
|---|---|---|
CELLSTATE_EVENT_DAG_PATH | /tmp/cellstate-event-dag | LMDB path for event DAG |
Context Assembly
| Variable | Default | Description |
|---|---|---|
CELLSTATE_CONTEXT_REST_TOKEN_BUDGET | 8000 | Default token budget |
CELLSTATE_CONTEXT_MAX_NOTES | 10 | Max notes per assembly |
CELLSTATE_CONTEXT_MAX_ARTIFACTS | 5 | Max artifacts per assembly |
CELLSTATE_CONTEXT_MAX_TURNS | 20 | Max turns per assembly |
CELLSTATE_CONTEXT_CACHE_MAX_ENTRIES | 10000 | Context cache size |
CELLSTATE_CONTEXT_CACHE_TTL_SECS | 300 | Context cache TTL |
CORS & Rate Limiting
| Variable | Default | Description |
|---|---|---|
CELLSTATE_CORS_ORIGINS | (empty = allow all) | Comma-separated allowed origins |
CELLSTATE_RATE_LIMIT_ENABLED | true | Enable rate limiting |
CELLSTATE_RATE_LIMIT_UNAUTHENTICATED | 100 | Requests/min per IP |
CELLSTATE_RATE_LIMIT_AUTHENTICATED | 1000 | Requests/min per tenant |
CELLSTATE_RATE_LIMIT_BURST | 10 | Burst capacity |
Idempotency
| Variable | Default | Description |
|---|---|---|
CELLSTATE_IDEMPOTENCY_REQUIRE_KEY | true | Require idempotency keys on mutations |
CELLSTATE_IDEMPOTENCY_TTL_SECS | 86400 | Key TTL |
Embedding Providers
| Variable | Description |
|---|---|
CELLSTATE_OPENAI_API_KEY or OPENAI_API_KEY | OpenAI API key |
CELLSTATE_EMBEDDING_MODEL | Model name (default: text-embedding-3-small) |
CELLSTATE_OPENROUTER_API_KEY | OpenRouter API key |
CELLSTATE_OPENROUTER_EMBEDDING_MODEL | OpenRouter model (e.g., openai/text-embedding-3-small) |
CELLSTATE_OLLAMA_EMBEDDING_MODEL | Ollama model (default: nomic-embed-text) |
CELLSTATE_EMBEDDING_ROUTING | first, round_robin, least_latency, random |
Multi-Instance
| Variable | Default | Description |
|---|---|---|
CELLSTATE_EXPECT_MULTI_INSTANCE | false | Enable multi-instance safety checks |
CELLSTATE_PG_JOURNAL_POLLER_DISABLED | false | Disable cross-instance cache invalidation |
CELLSTATE_INSTANCE_ID | (auto) | Unique instance identifier |
Security
| Variable | Default | Description |
|---|---|---|
CELLSTATE_OAUTH_VAULT_ENCRYPTION_KEYS_JSON | (none) | Required in staging/production. JSON array of encryption keys for the OAuth token vault. Server refuses to start without it in non-development environments. |
CELLSTATE_WS_CAPACITY | 1000 | WebSocket broadcast channel capacity. Max concurrent connections hard-capped at 10,000. |
LEMONSQUEEZY_WEBHOOK_SECRET | (none) | Webhook HMAC secret for billing events. Empty strings are rejected. |
Observability
| Variable | Default | Description |
|---|---|---|
CELLSTATE_METRICS_ENABLED | false | Enable Prometheus metrics on /metrics |
CELLSTATE_METRICS_AUTH_TOKEN | (none) | Bearer token for /metrics endpoint |
CELLSTATE_OTLP_ENDPOINT | (none) | OTLP exporter endpoint for traces |
CELLSTATE_TRACE_SAMPLE_RATE | 0.1 | Trace sampling rate (0.0–1.0) |
SENTRY_DSN | (none) | Sentry error tracking DSN |
Deployment Platforms
Example configurations are provided in the deployment examples in the CellState source repository:
| Platform | Path | Notes |
|---|---|---|
| Linode | examples/deploy/linode/ | Bare metal + Cloudflare Tunnel, ~$41/mo |
| Fly.io | examples/deploy/fly.io/ | Managed containers with Postgres |
| Railway | examples/deploy/railway/ | One-click deploy |
| Kubernetes | examples/deploy/helm/ | Helm chart with HPA, PDB, ServiceMonitor |
Systemd (Bare Metal)
The Linode example includes a hardened systemd unit at the Linode systemd unit in the CellState source repository with:
- Filesystem and kernel hardening (
ProtectSystem=strict, syscall filtering) - Memory limits (
MemoryMax=2G) - OOM score adjustment (prefer killing app over PostgreSQL)
- Automatic restart on failure
sudo cp cellstate.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now cellstate
Background Jobs
The server runs 14 supervised background jobs:
| Job | Purpose | Config Env Prefix |
|---|---|---|
change_relay | Polls DB changes, broadcasts via WebSocket | CELLSTATE_CHANGE_RELAY_* |
saga_cleanup | Cleans stuck delegations/handoffs | CELLSTATE_SAGA_CLEANUP_* |
summarization_executor | Processes pending summarization requests | CELLSTATE_SUMMARIZATION_* |
ttl_cleanup | Enforces memory TTL policies | CELLSTATE_TTL_CLEANUP_* |
scope_turn_reaper | Closes turns when scopes close | CELLSTATE_SCOPE_TURN_REAPER_* |
apology_miner | Detects context drift language patterns | CELLSTATE_APOLOGY_MINER_* |
memory_decay | Warmth decay and cold memory archival | CELLSTATE_MEMORY_DECAY_* |
hash_chain_audit | Event DAG integrity verification | CELLSTATE_HASH_CHAIN_AUDIT_* |
agent_deliberation | BDI engine tick loop | CELLSTATE_DELIBERATION_* |
drift_detection | Agent pair alignment checking | CELLSTATE_DRIFT_* |
killed_agents_sync | Cross-instance killed agent propagation | CELLSTATE_KILLED_AGENTS_* |
agent_scheduler | Cron-based agent trigger dispatch | CELLSTATE_SCHEDULER_* |
oauth_refresh | Token refresh with lock + backoff | CELLSTATE_OAUTH_* |
metrics_updater | Periodic entity gauge updates | (15s interval, not configurable) |
All jobs use the supervisor pattern with automatic restart on panic. Job health is exposed at /health/ready (with credentials).
Monitoring
Prometheus Metrics
Enable with CELLSTATE_METRICS_ENABLED=true. Metrics are exposed at GET /metrics.
Optionally protect with CELLSTATE_METRICS_AUTH_TOKEN=your-token (requires Authorization: Bearer your-token).
OTLP Traces
Set CELLSTATE_OTLP_ENDPOINT to export traces via OpenTelemetry Protocol (e.g., to Jaeger, Grafana Tempo, or Honeycomb).
Health Endpoints
| Endpoint | Auth | Purpose |
|---|---|---|
GET /health/ping | None | Liveness probe |
GET /health/live | None | Process alive check |
GET /health/ready | Optional | Readiness probe (database + job health with credentials) |
PostgreSQL Setup
CellState requires PostgreSQL 18 with the pgvector extension.
# Install pgvector (Ubuntu/Debian)
sudo apt install postgresql-18-pgvector
# Create database and user
sudo -u postgres createuser cellstate
sudo -u postgres createdb -O cellstate cellstate
sudo -u postgres psql -d cellstate -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Apply schema and migrations
make db-migrate
Backups
The Linode deployment example includes a WAL backup script at examples/deploy/linode/pg-backup.sh for streaming backups to Object Storage.
Troubleshooting
Server won’t start: “PCP config load failed”
Run make db-migrate to initialize the schema and seed the config table.
“pgvector extension not found”
Install postgresql-18-pgvector and run CREATE EXTENSION vector in the cellstate database.
Cache errors on startup
Ensure the CELLSTATE_CACHE_PATH directory exists and is writable. Default is /tmp/cellstate-cache.
Multi-instance cache drift
Enable CELLSTATE_EXPECT_MULTI_INSTANCE=true and ensure CELLSTATE_PG_JOURNAL_POLLER_DISABLED is not set. The PG change journal poller provides ~2s staleness for cross-instance cache invalidation.