Operate - Comerix Flow

Operator runbook for self-hosted Comerix Flow deployments. Covers the async queue, failed message recovery, webhook replay protection, the service-to-service APIs (engine, training crawler, MCP), the edge document store behind the public read endpoints, and token-usage accounting.

Async queue

FlowEngine routes four message types through the execution_events Doctrine transport:

StartExecution — fires when a trigger matches a published flow.
ResumeExecution — fires when a paused execution has new input.
RunNextStep — the worker loop; re-dispatched after each node.
FireWait — fires when a wait node’s timer elapses.

Each message acquires a per-execution Redis lease before running its step, so two workers can never race on the same execution. The lease releases when the step completes or its TTL expires (default 60 s).

Retry strategy

Configured in src/FlowEngine/Resources/config/packages/messenger.yaml:

Knob	Value	Meaning
`max_retries`	3	Total of 4 attempts (initial + 3 retries).
`delay`	1000 ms	First retry waits 1 s.
`multiplier`	2	Exponential backoff: 1 s → 2 s → 4 s.
`max_delay`	30000 ms	Cap so long-running outages don’t push a single retry past 30 s.

After the last retry exhausts, the message is moved to the global failed transport (failure_transport: failed in config/packages/messenger.yaml). It is NOT lost — operators replay it manually after investigating the root cause.

Failed message workflow

List everything in the dead-letter queue:

bin/console messenger:failed:show

Inspect a specific message (full payload + last failure):

bin/console messenger:failed:show <id> --max-attempts=0

Retry one message (it re-enters the execution_events transport and gets the standard retry policy again):

bin/console messenger:failed:retry <id>

Retry everything in the failed queue (use sparingly — investigate first):

bin/console messenger:failed:retry --force

Drop a message that will never succeed (e.g. flow was deleted, payload is permanently malformed):

bin/console messenger:failed:remove <id> --force

When investigating failures

messenger:failed:show <id> to read the exception trace.
Cross-reference the execution_id in the message payload against fs_executions + fs_execution_events for the full per-step history.
Fix the underlying issue (config, secret rotation, network), then messenger:failed:retry <id>. The execution resumes from the last committed snapshot.

Webhook replay protection

HmacVerifier rejects any signed payload it has already seen, in addition to the standard timestamp-drift check. Two cache backends ship with the platform:

Binding	Adapter	Where it’s used
Production (default)	`RedisReplayCache` — atomic `SET key 1 EX <ttl> NX`	Every non-test env; cluster-safe
`when@test`	`InMemoryReplayCache` — process-local LRU	Unit + integration tests

The Redis pool is flow_engine.replay_cache (defined in src/FlowEngine/Resources/config/packages/cache.yaml); the binding lives in src/FlowEngine/Resources/config/services.yaml.

TTL semantics

HmacVerifier::REPLAY_CACHE_TTL_SECONDS = 600 (2× the 300 s timestamp drift window). A signature stays blocked for ten minutes after first observation, so an attacker can’t sneak a replayed signature back through by skewing their clock a few minutes forward. The cache is keyed by flow_engine.replay.<sha256(signature)> — sharing a single Redis instance with the application cache pool is safe.

What to watch

A spike in Signature replay detected rejections is normal during retry storms but unusual otherwise — investigate if it climbs in steady state.
Webhook secrets are stored encrypted; rotation is in bin/console app:config:rotate-key (see App\Config\Command\ConfigRotateKeyCommand).
WooCommerce and Shopify signature schemes don’t carry timestamps, so the replay cache is their ONLY defense against replays. Keep Redis healthy.

Engine & internal APIs

Three server-to-server surfaces share one deploy-wide bearer credential. They sit on the stateless engine_api firewall (config/packages/security.yaml), which matches:

^/(api/v1/(engine|insights|training)/|metrics(/|$))

A caller authenticates by sending the shared secret as a bearer token:

Authorization: Bearer <ENGINE_API_TOKEN>

Surface	Path prefix	What it does
Engine API	`/api/v1/engine/`	Event ingest, intents, chat triggers, execution resume — see Engine API.
Conversions ingest	`/api/v1/insights/`	Server-side conversion ingest — see Engine API → Conversions.
Training crawler	`/api/v1/training/`	The knowledge-base crawler queue/claim/progress/result (below).
Prometheus metrics	`/metrics`	Cross-tenant operational aggregates; gated by the same bearer rather than left public. Scrapers must send the token too.

ENGINE_API_TOKEN

The token is read from the ENGINE_API_TOKEN environment variable and validated by Comerix\FlowEngine\Security\EngineApiTokenHandler, which compares it with hash_equals(). On a match the caller is identified as the synthetic in-memory engine-service account (granted ROLE_ENGINE_API by the engine_service_provider security provider) — there is no per-user session.

It is a deploy-wide root credentialIn V1 the token is not bound to a tenant: for the mutating calls the tenant_id rides in the request body and is trusted. Treat the token like a database password — store it as a secret, never in source control or client code, and rotate it by redeploying with a new value.If the variable is unset or empty the handler rejects every request (an empty token would otherwise match an empty header and let everyone in), so the three surfaces fail closed until you configure it.

What to watch

Any caller presenting the shared token clears the firewall, so leakage exposes all three surfaces at once. Audit access at the edge (reverse proxy / WAF) and watch for unexpected source IPs.
A 401/403 storm on /api/v1/(engine|insights|training)/ usually means a rotated token was not propagated to every caller (crawler, conversion poster, scrapers).

Knowledge-base crawler

An external website crawler keeps each tenant’s Training Base sources fresh. It talks to the platform over four endpoints on the shared engine_api firewall (Comerix\Training\Controller\Api\TrainingCrawlerApiController), pulling work, claiming a source, streaming page counts, and reporting the outcome:

Method	Route	Purpose
`GET`	`/api/v1/training/queue`	List sources due to be crawled (cross-tenant). `?limit=` caps the page (default 50, hard max 200).
`POST`	`/api/v1/training/sources/{id}/claim`	Move a due source into the `crawling` state.
`POST`	`/api/v1/training/sources/{id}/progress`	Record live `pages_indexed` / `pages_total`.
`POST`	`/api/v1/training/sources/{id}/result`	Finalise to `ready` or `failed`.

Every mutating call carries the target tenant_id (a UUID) in its JSON body — the shared token is not tenant-bound, so the body selects the workspace.

Cross-tenant queue

GET /api/v1/training/queue sweeps every active tenant in turn (CrawlQueueReader): for each tenant it pins the tenant context so the due-source query runs inside row-level security, accumulates matches, and stops as soon as the requested limit is filled. The tenant context is cleared after each tenant and again at the end of the request, so a pooled connection is never left pinned to a stale tenant. A source is “due” when it is freshly scheduled, when a recurring source has reached its next_crawl_at, or when an in-flight crawl has gone stale.

Claim lease (stale recovery)

POST …/claim is the concurrency gate for a crawler fleet. It runs a single guarded UPDATE (TrainingSourceRepository::claimForCrawl) that flips a source into crawling and stamps updated_at = now() only when it is in a claimable state — so two workers can never crawl the same source at once. The claim is a lease, not a lock: a source that has sat in crawling longer than STALE_CRAWLING_MINUTES (30 minutes) is treated as abandoned (the crawler that held it died) and offered for crawl again. There is no explicit release — progress and result keep updated_at moving, so a live crawl never goes stale; a dead one self-recovers after 30 minutes.

Idempotent result reportingA result call that reports the state a source is already in is treated as a no-op, so a report retried after a network blip is safe. On ready the next crawl time is recomputed from the source’s schedule (cleared for manual sources); on failed the error is stored and no re-crawl is scheduled.

Audit trail

Because the surface uses the shared deploy-wide token (with no per-user identity), each mutating call is written to the audit log under the action training.api.call (entity type training_source), recording the endpoint key and target source so use of the shared token is detectable after the fact. The queue read is not audited.

MCP gateway

The Model Context Protocol gateway exposes platform tools (analytics, flows, connections) to AI agents over one JSON-RPC 2.0 endpoint, POST /mcp (Comerix\Mcp\Controller\McpController). It is not on the shared engine token — it has its own stateless firewall, mcp (^/mcp), whose sole credential is a scoped personal access token:

Authorization: Bearer cfp_…

The firewall is kept separate from the interactive main firewall so it never inherits the session, form-login or CSRF exemptions, and it runs the Comerix\Identity\Security\ActiveUserChecker user checker — a deactivated user’s token is refused at authentication. The ^/mcp access-control rule only requires ROLE_USER; real authorization happens per call inside the gateway.

Double-gated authorization

Every tools/call and resource read passes through Comerix\Mcp\Service\Authorization\McpAuthorizer, which enforces two independent layers (tenant isolation is enforced separately by Postgres row-level security):

ACL — the caller’s role must grant the tool’s required ACL resource (the same check the admin UI uses).
Scope — the presented token’s scopes must imply the tool’s required scope.

A denial distinguishes the two in the error data (resource vs required_scope). Tool listings are filtered the same way, so a token never sees a tool it cannot call. Scopes fail closed — an unrecognised scope grants nothing.

Scope	Implies
`mcp`	Root — every other scope below.
`mcp:analytics`	`mcp:analytics:read`, `mcp:analytics:write`
`mcp:connections`	`mcp:connections:read`, `mcp:connections:manage`
`mcp:flows`	`mcp:flows:read`
`mcp:history`	`mcp:history:read`

The tenant a call acts on is taken from the authenticated user’s workspace (McpInvocationContextFactory), never from the request — there is no body tenant_id on this surface.

Audit trail

Every tool call writes one tenant-scoped audit entry under the action mcp.tool.call (entity type mcp_tool), recording the correlation request_id, the required scope, the argument keys only (never their values, so secrets passed to a tool are never persisted) and whether the tool reported a business error.

Idempotency-Key replay for write tools

A caller may send an Idempotency-Key header. For a write tool — one whose annotations carry a destructiveHint or idempotentHint — the first successful result is cached for 24 hours and a retry with the same key replays that result instead of repeating the side effect (McpIdempotencyStore). The cache key is bound to tenant + calling user + tool + key + a canonical fingerprint of the arguments, so a reused key cannot replay another user’s result or a result computed for different arguments. The header is ignored for read-only tools, and errored calls are never cached (so a retry re-attempts the work).

Edge document store

The anonymous public reads — GET /api/public/v1/chat/quick-questions and GET /api/public/v1/config/{section}, documented in the Public read API — are backed by published documents. Whenever the underlying data changes, the platform recomputes the affected JSON document and writes it into a key-value backend, so the hot read path is a single key lookup instead of a set of per-request database queries. When no document is stored (or no backend is configured) the response is computed live with identical output; the X-Edge-Source response header reports which path served it (store or live), and the body’s publishedAt is null for live responses.

EDGE_STORE_DSN

The backend is selected by the EDGE_STORE_DSN environment variable. Schemes:

Scheme	Example	Behavior
`null://`	`null://`	Default. Persists nothing and never serves a read — every request computes live. The system behaves exactly as if no store existed.
`redis://` (or `rediss://` for TLS)	`redis://localhost:6379/2?prefix=edge:&ttl=86400`	Recommended starting backend. Options: `prefix` — storage-key prefix (default `edge:`); `ttl` — per-document expiry in seconds (unset = no expiry). Must resolve to a single-node Redis client.
`cloudflare-kv://`	`cloudflare-kv://<API_TOKEN>@<ACCOUNT_ID>/<NAMESPACE_ID>`	Write-through to Cloudflare Workers KV; origin reads stay live — the stored values are meant to be read by your edge workers, not back by the origin. Percent-encode reserved characters in the token.
`dynamodb://`	`dynamodb://default/<TABLE>?region=eu-central-1` (or `dynamodb://KEY:SECRET@default/<TABLE>?region=…`)	One item per document (`pk = "{namespace}#{key}"`). Speaks the plain DynamoDB HTTP API — no PHP DAX client exists, so reads always hit the base table; size read capacity accordingly, or front the reads with HTTP caching. An `endpoint` option overrides the API URL (e.g. DynamoDB Local).

The store holds two document namespaces: webchat (one {tenantUuid}/quick-questions document per tenant) and config (one {tenantUuid}/{section} document per tenant and publicly readable section).

When documents (re)publish

Republishing is automatic and event-driven — each trigger enqueues one message per affected document on the edge_store Doctrine transport, so a worker must consume it alongside the other queues:

bin/console messenger:consume edge_store

Trigger	What republishes
An admin saves configuration	The affected tenant’s `config` section document(s) and its quick-questions document.
A flow is published or withdrawn	The tenant’s quick-questions document — which prompts carry a triggerable `intentName` changes with the published set.

Failed republish messages land in the same global failed transport as engine messages — inspect and replay them with the failed message workflow.

Manual rebuild

bin/console edge-store:republish [--namespace=<ns>] [--key=<key>]

Without options it queues a rebuild of every document of every producer; --namespace= (webchat or config) limits it to one namespace, and --key= (requires --namespace) to one document. The command only enqueues — the worker does the writes. Run a full republish after configuring a backend for the first time, after a data restore, or after flushing the store.

What to watch

Persistent X-Edge-Source: live on production responses means documents aren’t being published — check the edge_store worker and the backend health.
The store is a cache of published state, never the source of truth. It is safe to flush; reads fall back to live until the next republish (at the cost of per-request computation).

PII redaction in the event log

fs_execution_events retains payloads for 90 days. To keep customer data out of an operator-readable log, App\FlowEngine\Service\Pii\PiiRedactor rewrites each event payload before it is serialised. Strings inside the payload are scanned for known PII shapes and replaced with [REDACTED-<KIND>] markers; non-string leaves (ints, bools, nulls) and the surrounding array shape pass through untouched.

What gets redacted

Kind	Matches
`EMAIL`	RFC 5322-shaped addresses (`[email protected]`)
`PHONE`	E.164 + common formats with separators, 8–15 digits
`AUTH`	`Bearer <token>` strings
`TOKEN`	Provider prefixes (`sk_`, `pk_`, `api_`, `key_`, `tok_`, `secret_`, `access_`, `auth_`) followed by ≥8-char body — covers Stripe, OpenAI, internal keys
`JWT`	Three base64url segments separated by dots, each ≥10 chars
`AWSKEY`	`AKIA[A-Z0-9]{16}` access-key ids

The ruleset is intentionally aggressive — over-redaction (a UUID incorrectly tagged) is cheap; missed PII is not.

Toggling

The redactor reads flow_engine.pii_redaction.enabled (defined in src/FlowEngine/Resources/config/services.yaml). Override per deploy in the deploy’s container config or by binding the parameter to false in a local services.yaml. Recommended only for single-tenant dev environments that accept the GDPR / retention tradeoff.

Failure mode to watch for

If a tenant reports that legitimate node output looks redacted, check whether their data matches one of the patterns above (an internal id with a sk_ prefix is the typical false positive). Add the exception by narrowing the regex in PiiRedactor::PATTERNS and adjust the unit tests in tests/Unit/FlowEngine/Service/Pii/.

Insights (telemetry, goals & analytics)

The admin-facing reporting surface — the report builder, datasets, exports, pinned dashboards and the scheduling UI — is documented in Analytics Hub. This section covers the moving parts an operator runs: the ingest worker, the three Insights cron jobs, and the CLI. Client telemetry posted to /api/public/v1/chat/events[/batch] is written to ins_telemetry_events on the request, then goal matching is queued on the insights Doctrine transport (so widget latency is independent of goal complexity). Run a worker for it alongside the engine queue:

bin/console messenger:consume insights execution_events

Exhausted matching messages land in the same global failed transport as engine messages — inspect/replay them with the workflow above.

Analytics refresh

Materialized views back the dashboards — ins_goal_stats_daily, ins_conversation_stats_daily and ins_token_usage_daily (see Token usage & cost). RefreshInsightsCronJob (insights_analytics_refresh, 9 * * * *) calls ins_refresh_analytics(), which refreshes the conversation-stats and token-usage views CONCURRENTLY (under a single advisory lock) with a non-concurrent fallback for an as-yet-unseeded view. The cron connection must run on a BYPASSRLS role (fs_service) so the cross-tenant aggregation succeeds. Headline conversion totals read the live ledger, so they’re real-time; the daily series/breakdowns lag by at most the refresh interval.

Retention purge

PurgeTelemetryCronJob (23 3 * * *) deletes raw telemetry older than insights.telemetry_retention_days (default 90; override the parameter per deploy). It is a single global DELETE and must run on the BYPASSRLS role — on an RLS-scoped connection with no tenant set the policy matches zero rows and the purge silently does nothing. Goal completions are retained (they’re the conversion record).

Scheduled report delivery

ReportScheduleCronJob (5 * * * *) emails saved reports that carry a schedule. Each hour it reads every scheduled report across all tenants (so it too must run on the BYPASSRLS role), and for each one that is due — matching its UTC hour, and ISO weekday for weekly schedules — it sets the tenant context, runs the report in export mode, and emails an HTML summary with a CSV attachment to the configured recipients. It then stamps last_sent_on so re-running the cron never double-sends within the day; one report’s failure never aborts the batch. Delivery needs MAILER_FROM configured. Admins author schedules from a saved report — see Analytics Hub → Scheduled reports.

Operator commands

bin/console insights:goals:rematch <tenantId> <goalCode> — replay stored telemetry through matching so a newly created/edited goal credits events already captured within its window.
bin/console insights:purge-customer <tenantId> <customerId> — GDPR erasure: deletes a customer’s telemetry events and goal completions, tenant-scoped.
bin/console insights:provision-defaults <tenantId|slug> (or --all) — create the default goals plus the starter saved/pinned reports for a tenant. Idempotent (a tenant that already has goals/reports is left untouched), so it’s safe to re-run and to use when onboarding a new tenant.
bin/console insights:demo-seed <tenantId> [--days=90] — seed demo conversations, telemetry, goals and conversions over N days, then refresh the analytics views, so the Insights dashboards have realistic data. Intended for dev/staging, not production.

Token usage & cost

The dashboard Token spend widget and the Analytics Hub token-usage dataset read the ins_token_usage_daily materialized view. It rolls up the per-message token columns on fs_chat_messages (joined to fs_executions for the flow) by tenant, day and flow — prompt / completion / total tokens, cost_micros, and message / conversation counts — over a rolling 90-day window, so reports never scan the raw message log. The view is refreshed by the same hourly Insights cron as the other analytics MVs — ins_refresh_analytics() (see Analytics refresh) — so token spend lags live activity by at most the refresh interval, and the cron must run on the BYPASSRLS role. If the widget looks stale, check that insights_analytics_refresh is firing (it runs via app:cron:run).

Pricing configuration

Cost is estimated at write time, when each agent reply is stored, not in the MV — the MV only sums cost_micros. Comerix\FlowEngine\Service\Pricing\ModelPricingProvider prices a call from its token counts using prices in USD per 1,000 tokens:

A built-in fallback table covers common models (so cost works before any configuration). These figures are best-effort as of 2026-06 and are not authoritative — provider list prices change.
Admins override or extend the table per model in Settings → AI / LLM → Token pricing → Model prices (the llm/pricing/models config table — a rows field of Model, Prompt $/1K* and *Completion$ /1K). Configured rows always win over the built-in defaults; a model that is neither configured nor in the defaults is costed at zero.

Pricing changes are not retroactiveBecause cost is captured on the message when the reply is written, editing the price table only affects future calls. Historical cost_micros reflect the prices in effect when each message was stored — re-pricing past traffic would require replaying it.

​Async queue

​Retry strategy

​Failed message workflow

​When investigating failures

​Webhook replay protection

​TTL semantics

​What to watch

​Engine & internal APIs

​ENGINE_API_TOKEN

​What to watch

​Knowledge-base crawler

​Cross-tenant queue

​Claim lease (stale recovery)

​Audit trail

​MCP gateway

​Double-gated authorization

​Audit trail

​Idempotency-Key replay for write tools

​Edge document store

​EDGE_STORE_DSN

​When documents (re)publish

​Manual rebuild

​What to watch

​PII redaction in the event log

​What gets redacted

​Toggling

​Failure mode to watch for

​Insights (telemetry, goals & analytics)

​Analytics refresh

​Retention purge

​Scheduled report delivery

​Operator commands

​Token usage & cost

​Pricing configuration

Async queue

Retry strategy

Failed message workflow

When investigating failures

Webhook replay protection

TTL semantics

What to watch

Engine & internal APIs

ENGINE_API_TOKEN

What to watch

Knowledge-base crawler

Cross-tenant queue

Claim lease (stale recovery)

Audit trail

MCP gateway

Double-gated authorization

Audit trail

Idempotency-Key replay for write tools

Edge document store

EDGE_STORE_DSN

When documents (re)publish

Manual rebuild

What to watch

PII redaction in the event log

What gets redacted

Toggling

Failure mode to watch for

Insights (telemetry, goals & analytics)

Analytics refresh

Retention purge

Scheduled report delivery

Operator commands

Token usage & cost

Pricing configuration