Training base - Comerix Flow

The Training base is the knowledge base that grounds your AI agents’ answers. It lives at /admin/training (route admin_training_index, menu Knowledge → Training base).

What it’s for

Register sources — websites your AI agents should learn from. Each source is crawled, its text extracted and indexed, so your flows can retrieve relevant snippets at answer time and respond with grounded, up-to-date context instead of guessing. A source picks how it is discovered:

Type	What gets indexed
Website	Start at the URL and follow every reachable link on the same domain, up to the configured crawl depth.
Sitemap	Fetch the sitemap and index every URL it lists.
Single page	Index just the one URL you provide — nothing else.

Using a source in a flowA flow references the training base from an AI agent node to inject the top matches as grounded context. Indexing a source here is what makes that content available for retrieval.

Adding a source

In the Add a source panel:

Choose a source type — Website, Sitemap, or Single page.
Paste the URL (for example https://www.example.com).
Click Save.

The source is created in the scheduled state and queued for crawling. A flash confirms it: “Source example.com queued for crawling.”

Advanced settings

Expand Advanced settings to tune the crawl before you save:

Field	What it does	Default
Crawl depth	How many link hops away from the starting URL to follow. Accepted range is 1–20.	3
Recrawl schedule	How often the source is re-fetched: Manual, Daily, Weekly, or Monthly.	Manual
Include patterns	Comma-separated paths the crawler is allowed to visit. Globs supported (e.g. `/blog/, /docs/`).	—
Exclude patterns	Paths the crawler must skip — useful for cart, checkout, and search pages (e.g. `/cart/, /checkout/, *.pdf`).	—

URL validation

Every URL is checked before the source is saved:

It must be a well-formed http(s) URL with a host. Otherwise you get “Enter a valid http(s) URL.”
The host must not point inside the platform’s own network — localhost, *.local, *.localhost, and private or reserved IP addresses are rejected with “That host can’t be crawled.” This is a first line of defence against pointing the crawler at an internal address.
The same URL can’t be added twice in one workspace; a duplicate is rejected with “That URL is already in your training base.”

The source list

The list shows, per source:

Column	Meaning
Source	The host and the full starting URL.
Type	Website, Sitemap, or Single page.
Status	The current crawl state — see below.
Pages	Pages indexed. While crawling, shows `indexed / total` with a progress bar.
Last crawl	When the source last finished crawling, or queued if it hasn’t run yet.
Schedule	Manual, Daily, Weekly, or Monthly.

Three header metrics summarise the workspace: total Sources, total Pages indexed, and the Last crawl time. Use the All / Ready / Crawling / Failed filter chips above the list to narrow it by status.

Statuses

Status	Meaning
Scheduled	Queued and waiting to be picked up by the crawler.
Crawling	A crawler has claimed the source and is fetching pages; the row shows live page counts and a progress bar.
Ready	The crawl finished successfully; the source is indexed and available for retrieval.
Failed	The crawl failed; the row is flagged and the last error is recorded.

Re-crawl and delete

Each row has two actions (visible to users with manage permission):

Re-crawl resets the source to scheduled, marks it due now, and clears any previous error so the crawler picks it up on its next pass. A flash confirms “Re-crawl queued.” Use this after the source content changes or after a failure.
Delete removes the source from your workspace (“Source removed.”).

A third action opens the source URL in a new browser tab.

How crawling happens

This page only owns the operator UI and the source records it writes — crawling itself is performed by an external crawler service, not by the admin app. The crawler talks to the platform over a dedicated, server-to-server Knowledge-base API under /api/v1/training/. The lifecycle is:

Pull the queue — GET /api/v1/training/queue returns sources that are due to be crawled (across active workspaces). ?limit= caps the page size (default 50, maximum 200).
Claim a source — POST /api/v1/training/sources/{id}/claim moves a due source into the crawling state. The request body carries the owning tenant_id.
Report progress — POST /api/v1/training/sources/{id}/progress streams live page counts (pages_indexed, optional pages_total) while the crawl runs; these drive the progress bar in the list.
Report the result — POST /api/v1/training/sources/{id}/result finalises the crawl, moving the source to ready or failed. On success, the next crawl is automatically scheduled from the source’s recrawl schedule; on failure, the error is stored and no re-crawl is scheduled.

This API is authenticated with a deploy-wide bearer token (ENGINE_API_TOKEN), shared with the rest of the service-to-service surface, and every mutating call is written to the audit log. You don’t need to call it directly — it exists so a crawler can be hosted and scaled independently of the admin app.

Integrating a crawlerThe full request and response shapes for these endpoints, plus the auth contract, live in the Engine API reference.

Permissions

Resource	Grants
`comerix.training.view`	View sources and indexed snippets.
`comerix.training.manage`	Add, edit, re-crawl, and delete sources.

Grant these per role on Users, roles & permissions.

​What it’s for

​Adding a source

​Advanced settings

​URL validation

​The source list

​Statuses

​Re-crawl and delete

​How crawling happens

​Permissions

What it’s for

Adding a source

Advanced settings

URL validation

The source list

Statuses

Re-crawl and delete

How crawling happens

Permissions