Operator runbook for self-hosted Comerix Flow deployments. Covers the async
queue, failed message recovery, webhook replay protection, the service-to-service
APIs (engine, training crawler, MCP), the edge document store behind the public
read endpoints, and token-usage accounting.
Async queue
FlowEngine routes four message types through the execution_events Doctrine
transport:
StartExecution — fires when a trigger matches a published flow.
ResumeExecution — fires when a paused execution has new input.
RunNextStep — the worker loop; re-dispatched after each node.
FireWait — fires when a wait node’s timer elapses.
Each message acquires a per-execution Redis lease before running its step,
so two workers can never race on the same execution. The lease releases
when the step completes or its TTL expires (default 60 s).
Retry strategy
Configured in
src/FlowEngine/Resources/config/packages/messenger.yaml:
| Knob | Value | Meaning |
|---|
max_retries | 3 | Total of 4 attempts (initial + 3 retries). |
delay | 1000 ms | First retry waits 1 s. |
multiplier | 2 | Exponential backoff: 1 s → 2 s → 4 s. |
max_delay | 30000 ms | Cap so long-running outages don’t push a single retry past 30 s. |
After the last retry exhausts, the message is moved to the global failed
transport (failure_transport: failed in config/packages/messenger.yaml).
It is NOT lost — operators replay it manually after investigating the
root cause.
Failed message workflow
List everything in the dead-letter queue:
bin/console messenger:failed:show
Inspect a specific message (full payload + last failure):
bin/console messenger:failed:show <id> --max-attempts=0
Retry one message (it re-enters the execution_events transport and gets
the standard retry policy again):
bin/console messenger:failed:retry <id>
Retry everything in the failed queue (use sparingly — investigate first):
bin/console messenger:failed:retry --force
Drop a message that will never succeed (e.g. flow was deleted, payload is
permanently malformed):
bin/console messenger:failed:remove <id> --force
When investigating failures
messenger:failed:show <id> to read the exception trace.
- Cross-reference the
execution_id in the message payload against
fs_executions + fs_execution_events for the full per-step history.
- Fix the underlying issue (config, secret rotation, network), then
messenger:failed:retry <id>. The execution resumes from the
last committed snapshot.
Webhook replay protection
HmacVerifier rejects any signed payload it has already seen, in addition
to the standard timestamp-drift check. Two cache backends ship with the
platform:
| Binding | Adapter | Where it’s used |
|---|
| Production (default) | RedisReplayCache — atomic SET key 1 EX <ttl> NX | Every non-test env; cluster-safe |
when@test | InMemoryReplayCache — process-local LRU | Unit + integration tests |
The Redis pool is flow_engine.replay_cache (defined in
src/FlowEngine/Resources/config/packages/cache.yaml); the binding lives in
src/FlowEngine/Resources/config/services.yaml.
TTL semantics
HmacVerifier::REPLAY_CACHE_TTL_SECONDS = 600 (2× the 300 s timestamp
drift window). A signature stays blocked for ten minutes after first
observation, so an attacker can’t sneak a replayed signature back through
by skewing their clock a few minutes forward.
The cache is keyed by flow_engine.replay.<sha256(signature)> — sharing a
single Redis instance with the application cache pool is safe.
What to watch
- A spike in
Signature replay detected rejections is normal during retry
storms but unusual otherwise — investigate if it climbs in steady state.
- Webhook secrets are stored encrypted; rotation is in
bin/console app:config:rotate-key (see
App\Config\Command\ConfigRotateKeyCommand).
- WooCommerce and Shopify signature schemes don’t carry timestamps, so the
replay cache is their ONLY defense against replays. Keep Redis healthy.
Engine & internal APIs
Three server-to-server surfaces share one deploy-wide bearer credential. They
sit on the stateless engine_api firewall (config/packages/security.yaml),
which matches:
^/(api/v1/(engine|insights|training)/|metrics(/|$))
A caller authenticates by sending the shared secret as a bearer token:
Authorization: Bearer <ENGINE_API_TOKEN>
| Surface | Path prefix | What it does |
|---|
| Engine API | /api/v1/engine/ | Event ingest, intents, chat triggers, execution resume — see Engine API. |
| Conversions ingest | /api/v1/insights/ | Server-side conversion ingest — see Engine API → Conversions. |
| Training crawler | /api/v1/training/ | The knowledge-base crawler queue/claim/progress/result (below). |
| Prometheus metrics | /metrics | Cross-tenant operational aggregates; gated by the same bearer rather than left public. Scrapers must send the token too. |
ENGINE_API_TOKEN
The token is read from the ENGINE_API_TOKEN environment variable and validated
by Comerix\FlowEngine\Security\EngineApiTokenHandler, which compares it with
hash_equals(). On a match the caller is identified as the synthetic in-memory
engine-service account (granted ROLE_ENGINE_API by the engine_service_provider
security provider) — there is no per-user session.
It is a deploy-wide root credentialIn V1 the token is not bound to a tenant: for the mutating calls the
tenant_id rides in the request body and is trusted. Treat the token like a
database password — store it as a secret, never in source control or client
code, and rotate it by redeploying with a new value.If the variable is unset or empty the handler rejects every request
(an empty token would otherwise match an empty header and let everyone in),
so the three surfaces fail closed until you configure it.
What to watch
- Any caller presenting the shared token clears the firewall, so leakage exposes
all three surfaces at once. Audit access at the edge (reverse proxy / WAF) and
watch for unexpected source IPs.
- A
401/403 storm on /api/v1/(engine|insights|training)/ usually means a
rotated token was not propagated to every caller (crawler, conversion poster,
scrapers).
Knowledge-base crawler
An external website crawler keeps each tenant’s Training Base sources fresh.
It talks to the platform over four endpoints on the shared engine_api firewall
(Comerix\Training\Controller\Api\TrainingCrawlerApiController), pulling work,
claiming a source, streaming page counts, and reporting the outcome:
| Method | Route | Purpose |
|---|
GET | /api/v1/training/queue | List sources due to be crawled (cross-tenant). ?limit= caps the page (default 50, hard max 200). |
POST | /api/v1/training/sources/{id}/claim | Move a due source into the crawling state. |
POST | /api/v1/training/sources/{id}/progress | Record live pages_indexed / pages_total. |
POST | /api/v1/training/sources/{id}/result | Finalise to ready or failed. |
Every mutating call carries the target tenant_id (a UUID) in its JSON body —
the shared token is not tenant-bound, so the body selects the workspace.
Cross-tenant queue
GET /api/v1/training/queue sweeps every active tenant in turn
(CrawlQueueReader): for each tenant it pins the tenant context so the due-source
query runs inside row-level security, accumulates matches, and stops as soon
as the requested limit is filled. The tenant context is cleared after each tenant
and again at the end of the request, so a pooled connection is never left pinned
to a stale tenant. A source is “due” when it is freshly scheduled, when a
recurring source has reached its next_crawl_at, or when an in-flight crawl has
gone stale.
Claim lease (stale recovery)
POST …/claim is the concurrency gate for a crawler fleet. It runs a single
guarded UPDATE (TrainingSourceRepository::claimForCrawl) that flips a source
into crawling and stamps updated_at = now() only when it is in a claimable
state — so two workers can never crawl the same source at once. The claim is a
lease, not a lock: a source that has sat in crawling longer than
STALE_CRAWLING_MINUTES (30 minutes) is treated as abandoned (the crawler
that held it died) and offered for crawl again. There is no explicit release —
progress and result keep updated_at moving, so a live crawl never goes
stale; a dead one self-recovers after 30 minutes.
Idempotent result reportingA result call that reports the state a source is already in is treated as a
no-op, so a report retried after a network blip is safe. On ready the next
crawl time is recomputed from the source’s schedule (cleared for manual
sources); on failed the error is stored and no re-crawl is scheduled.
Audit trail
Because the surface uses the shared deploy-wide token (with no per-user identity),
each mutating call is written to the audit log under the action
training.api.call (entity type training_source), recording the endpoint
key and target source so use of the shared token is detectable after the fact.
The queue read is not audited.
MCP gateway
The Model Context Protocol gateway exposes platform tools (analytics, flows,
connections) to AI agents over one JSON-RPC 2.0 endpoint, POST /mcp
(Comerix\Mcp\Controller\McpController). It is not on the shared engine
token — it has its own stateless firewall, mcp (^/mcp), whose sole credential
is a scoped personal access token:
Authorization: Bearer cfp_…
The firewall is kept separate from the interactive main firewall so it never
inherits the session, form-login or CSRF exemptions, and it runs the
Comerix\Identity\Security\ActiveUserChecker user checker — a deactivated user’s
token is refused at authentication. The ^/mcp access-control rule only requires
ROLE_USER; real authorization happens per call inside the gateway.
Double-gated authorization
Every tools/call and resource read passes through Comerix\Mcp\Service\Authorization\McpAuthorizer,
which enforces two independent layers (tenant isolation is enforced separately
by Postgres row-level security):
- ACL — the caller’s role must grant the tool’s required ACL resource (the
same check the admin UI uses).
- Scope — the presented token’s scopes must imply the tool’s required scope.
A denial distinguishes the two in the error data (resource vs required_scope).
Tool listings are filtered the same way, so a token never sees a tool it cannot
call. Scopes fail closed — an unrecognised scope grants nothing.
| Scope | Implies |
|---|
mcp | Root — every other scope below. |
mcp:analytics | mcp:analytics:read, mcp:analytics:write |
mcp:connections | mcp:connections:read, mcp:connections:manage |
mcp:flows | mcp:flows:read |
mcp:history | mcp:history:read |
The tenant a call acts on is taken from the authenticated user’s workspace
(McpInvocationContextFactory), never from the request — there is no body
tenant_id on this surface.
Audit trail
Every tool call writes one tenant-scoped audit entry under the action
mcp.tool.call (entity type mcp_tool), recording the correlation
request_id, the required scope, the argument keys only (never their values,
so secrets passed to a tool are never persisted) and whether the tool reported a
business error.
A caller may send an Idempotency-Key header. For a write tool — one whose
annotations carry a destructiveHint or idempotentHint — the first successful
result is cached for 24 hours and a retry with the same key replays that
result instead of repeating the side effect (McpIdempotencyStore). The cache
key is bound to tenant + calling user + tool + key + a canonical fingerprint of
the arguments, so a reused key cannot replay another user’s result or a result
computed for different arguments. The header is ignored for read-only tools, and
errored calls are never cached (so a retry re-attempts the work).
Edge document store
The anonymous public reads — GET /api/public/v1/chat/quick-questions and
GET /api/public/v1/config/{section}, documented in the
Public read API — are backed by published documents.
Whenever the underlying data changes, the platform recomputes the affected JSON
document and writes it into a key-value backend, so the hot read path is a
single key lookup instead of a set of per-request database queries. When no
document is stored (or no backend is configured) the response is computed
live with identical output; the X-Edge-Source response header reports
which path served it (store or live), and the body’s publishedAt is
null for live responses.
EDGE_STORE_DSN
The backend is selected by the EDGE_STORE_DSN environment variable. Schemes:
| Scheme | Example | Behavior |
|---|
null:// | null:// | Default. Persists nothing and never serves a read — every request computes live. The system behaves exactly as if no store existed. |
redis:// (or rediss:// for TLS) | redis://localhost:6379/2?prefix=edge:&ttl=86400 | Recommended starting backend. Options: prefix — storage-key prefix (default edge:); ttl — per-document expiry in seconds (unset = no expiry). Must resolve to a single-node Redis client. |
cloudflare-kv:// | cloudflare-kv://<API_TOKEN>@<ACCOUNT_ID>/<NAMESPACE_ID> | Write-through to Cloudflare Workers KV; origin reads stay live — the stored values are meant to be read by your edge workers, not back by the origin. Percent-encode reserved characters in the token. |
dynamodb:// | dynamodb://default/<TABLE>?region=eu-central-1 (or dynamodb://KEY:SECRET@default/<TABLE>?region=…) | One item per document (pk = "{namespace}#{key}"). Speaks the plain DynamoDB HTTP API — no PHP DAX client exists, so reads always hit the base table; size read capacity accordingly, or front the reads with HTTP caching. An endpoint option overrides the API URL (e.g. DynamoDB Local). |
The store holds two document namespaces: webchat (one
{tenantUuid}/quick-questions document per tenant) and config (one
{tenantUuid}/{section} document per tenant and publicly readable section).
When documents (re)publish
Republishing is automatic and event-driven — each trigger enqueues one
message per affected document on the edge_store Doctrine transport, so a
worker must consume it alongside the other queues:
bin/console messenger:consume edge_store
| Trigger | What republishes |
|---|
| An admin saves configuration | The affected tenant’s config section document(s) and its quick-questions document. |
| A flow is published or withdrawn | The tenant’s quick-questions document — which prompts carry a triggerable intentName changes with the published set. |
Failed republish messages land in the same global failed transport as engine
messages — inspect and replay them with the
failed message workflow.
Manual rebuild
bin/console edge-store:republish [--namespace=<ns>] [--key=<key>]
Without options it queues a rebuild of every document of every producer;
--namespace= (webchat or config) limits it to one namespace, and
--key= (requires --namespace) to one document. The command only enqueues —
the worker does the writes. Run a full republish after configuring a backend
for the first time, after a data restore, or after flushing the store.
What to watch
- Persistent
X-Edge-Source: live on production responses means documents
aren’t being published — check the edge_store worker and the backend
health.
- The store is a cache of published state, never the source of truth. It
is safe to flush; reads fall back to live until the next republish (at the
cost of per-request computation).
PII redaction in the event log
fs_execution_events retains payloads for 90 days. To keep customer data
out of an operator-readable log,
App\FlowEngine\Service\Pii\PiiRedactor rewrites each event payload before
it is serialised. Strings inside the payload are scanned for known PII
shapes and replaced with [REDACTED-<KIND>] markers; non-string leaves
(ints, bools, nulls) and the surrounding array shape pass through
untouched.
What gets redacted
| Kind | Matches |
|---|
EMAIL | RFC 5322-shaped addresses ([email protected]) |
PHONE | E.164 + common formats with separators, 8–15 digits |
AUTH | Bearer <token> strings |
TOKEN | Provider prefixes (sk_, pk_, api_, key_, tok_, secret_, access_, auth_) followed by ≥8-char body — covers Stripe, OpenAI, internal keys |
JWT | Three base64url segments separated by dots, each ≥10 chars |
AWSKEY | AKIA[A-Z0-9]{16} access-key ids |
The ruleset is intentionally aggressive — over-redaction (a UUID
incorrectly tagged) is cheap; missed PII is not.
Toggling
The redactor reads flow_engine.pii_redaction.enabled (defined in
src/FlowEngine/Resources/config/services.yaml). Override per deploy in
the deploy’s container config or by binding the parameter to false in a
local services.yaml. Recommended only for single-tenant dev environments
that accept the GDPR / retention tradeoff.
Failure mode to watch for
If a tenant reports that legitimate node output looks redacted, check
whether their data matches one of the patterns above (an internal id with
a sk_ prefix is the typical false positive). Add the exception by
narrowing the regex in PiiRedactor::PATTERNS and adjust the unit tests
in tests/Unit/FlowEngine/Service/Pii/.
Insights (telemetry, goals & analytics)
The admin-facing reporting surface — the report builder, datasets, exports,
pinned dashboards and the scheduling UI — is documented in
Analytics Hub. This section covers the moving parts an
operator runs: the ingest worker, the three Insights cron jobs, and the CLI.
Client telemetry posted to /api/public/v1/chat/events[/batch] is written to
ins_telemetry_events on the request, then goal matching is queued on the
insights Doctrine transport (so widget latency is independent of goal
complexity). Run a worker for it alongside the engine queue:
bin/console messenger:consume insights execution_events
Exhausted matching messages land in the same global failed transport as
engine messages — inspect/replay them with the workflow above.
Analytics refresh
Materialized views back the dashboards — ins_goal_stats_daily,
ins_conversation_stats_daily and ins_token_usage_daily (see
Token usage & cost). RefreshInsightsCronJob
(insights_analytics_refresh, 9 * * * *) calls ins_refresh_analytics(),
which refreshes the conversation-stats and token-usage views CONCURRENTLY (under
a single advisory lock) with a non-concurrent fallback for an as-yet-unseeded view.
The cron connection must run on a BYPASSRLS role (fs_service) so the
cross-tenant aggregation succeeds. Headline conversion totals read the live
ledger, so they’re real-time; the daily series/breakdowns lag by at most the
refresh interval.
Retention purge
PurgeTelemetryCronJob (23 3 * * *) deletes raw telemetry older than
insights.telemetry_retention_days (default 90; override the parameter per
deploy). It is a single global DELETE and must run on the BYPASSRLS role —
on an RLS-scoped connection with no tenant set the policy matches zero rows and
the purge silently does nothing. Goal completions are retained (they’re the
conversion record).
Scheduled report delivery
ReportScheduleCronJob (5 * * * *) emails saved reports that carry a schedule.
Each hour it reads every scheduled report across all tenants (so it too must
run on the BYPASSRLS role), and for each one that is due — matching its UTC hour,
and ISO weekday for weekly schedules — it sets the tenant context, runs the
report in export mode, and emails an HTML summary with a CSV attachment to the
configured recipients. It then stamps last_sent_on so re-running the cron never
double-sends within the day; one report’s failure never aborts the batch.
Delivery needs MAILER_FROM configured. Admins author schedules from a saved
report — see Analytics Hub → Scheduled reports.
Operator commands
bin/console insights:goals:rematch <tenantId> <goalCode> — replay stored
telemetry through matching so a newly created/edited goal credits events
already captured within its window.
bin/console insights:purge-customer <tenantId> <customerId> — GDPR erasure:
deletes a customer’s telemetry events and goal completions, tenant-scoped.
bin/console insights:provision-defaults <tenantId|slug> (or --all) —
create the default goals plus the starter saved/pinned reports for a tenant.
Idempotent (a tenant that already has goals/reports is left untouched), so it’s
safe to re-run and to use when onboarding a new tenant.
bin/console insights:demo-seed <tenantId> [--days=90] — seed demo
conversations, telemetry, goals and conversions over N days, then refresh the
analytics views, so the Insights dashboards have realistic data. Intended for
dev/staging, not production.
Token usage & cost
The dashboard Token spend widget and the Analytics Hub token-usage dataset
read the ins_token_usage_daily materialized view. It rolls up the per-message
token columns on fs_chat_messages (joined to fs_executions for the flow) by
tenant, day and flow — prompt / completion / total tokens, cost_micros, and
message / conversation counts — over a rolling 90-day window, so reports never
scan the raw message log.
The view is refreshed by the same hourly Insights cron as the other analytics
MVs — ins_refresh_analytics() (see Analytics refresh) — so
token spend lags live activity by at most the refresh interval, and the cron must
run on the BYPASSRLS role. If the widget looks stale, check that
insights_analytics_refresh is firing (it runs via app:cron:run).
Pricing configuration
Cost is estimated at write time, when each agent reply is stored, not in the
MV — the MV only sums cost_micros. Comerix\FlowEngine\Service\Pricing\ModelPricingProvider
prices a call from its token counts using prices in USD per 1,000 tokens:
- A built-in fallback table covers common models (so cost works before any
configuration). These figures are best-effort as of 2026-06 and are not
authoritative — provider list prices change.
- Admins override or extend the table per model in Settings → AI / LLM → Token
pricing → Model prices (the
llm/pricing/models config table — a rows
field of Model, Prompt /1K∗and∗Completion/1K). Configured rows always
win over the built-in defaults; a model that is neither configured nor in the
defaults is costed at zero.
Pricing changes are not retroactiveBecause cost is captured on the message when the reply is written, editing the
price table only affects future calls. Historical cost_micros reflect the
prices in effect when each message was stored — re-pricing past traffic would
require replaying it.