The Training base is the knowledge base that grounds your AI agents’
answers. It lives at /admin/training (route admin_training_index, menu
Knowledge → Training base).
What it’s for
Register sources — websites your AI agents should learn from. Each source is
crawled, its text extracted and indexed, so your flows can retrieve relevant
snippets at answer time and respond with grounded, up-to-date context instead of
guessing.
A source picks how it is discovered:
| Type | What gets indexed |
|---|
| Website | Start at the URL and follow every reachable link on the same domain, up to the configured crawl depth. |
| Sitemap | Fetch the sitemap and index every URL it lists. |
| Single page | Index just the one URL you provide — nothing else. |
Using a source in a flowA flow references the training base from an AI agent node to inject the
top matches as grounded context. Indexing a source here is what makes that
content available for retrieval.
Adding a source
In the Add a source panel:
- Choose a source type — Website, Sitemap, or Single page.
- Paste the URL (for example
https://www.example.com).
- Click Save.
The source is created in the scheduled state and queued for crawling. A
flash confirms it: “Source example.com queued for crawling.”
Advanced settings
Expand Advanced settings to tune the crawl before you save:
| Field | What it does | Default |
|---|
| Crawl depth | How many link hops away from the starting URL to follow. Accepted range is 1–20. | 3 |
| Recrawl schedule | How often the source is re-fetched: Manual, Daily, Weekly, or Monthly. | Manual |
| Include patterns | Comma-separated paths the crawler is allowed to visit. Globs supported (e.g. /blog/*, /docs/*). | — |
| Exclude patterns | Paths the crawler must skip — useful for cart, checkout, and search pages (e.g. /cart/*, /checkout/*, *.pdf). | — |
URL validation
Every URL is checked before the source is saved:
- It must be a well-formed
http(s) URL with a host. Otherwise you get
“Enter a valid http(s) URL.”
- The host must not point inside the platform’s own network —
localhost,
*.local, *.localhost, and private or reserved IP addresses are rejected
with “That host can’t be crawled.” This is a first line of defence against
pointing the crawler at an internal address.
- The same URL can’t be added twice in one workspace; a duplicate is rejected
with “That URL is already in your training base.”
The source list
The list shows, per source:
| Column | Meaning |
|---|
| Source | The host and the full starting URL. |
| Type | Website, Sitemap, or Single page. |
| Status | The current crawl state — see below. |
| Pages | Pages indexed. While crawling, shows indexed / total with a progress bar. |
| Last crawl | When the source last finished crawling, or queued if it hasn’t run yet. |
| Schedule | Manual, Daily, Weekly, or Monthly. |
Three header metrics summarise the workspace: total Sources, total Pages
indexed, and the Last crawl time. Use the All / Ready / Crawling /
Failed filter chips above the list to narrow it by status.
Statuses
| Status | Meaning |
|---|
| Scheduled | Queued and waiting to be picked up by the crawler. |
| Crawling | A crawler has claimed the source and is fetching pages; the row shows live page counts and a progress bar. |
| Ready | The crawl finished successfully; the source is indexed and available for retrieval. |
| Failed | The crawl failed; the row is flagged and the last error is recorded. |
Re-crawl and delete
Each row has two actions (visible to users with manage permission):
- Re-crawl resets the source to scheduled, marks it due now, and
clears any previous error so the crawler picks it up on its next pass. A flash
confirms “Re-crawl queued.” Use this after the source content changes or
after a failure.
- Delete removes the source from your workspace (“Source removed.”).
A third action opens the source URL in a new browser tab.
How crawling happens
This page only owns the operator UI and the source records it writes —
crawling itself is performed by an external crawler service, not by the admin
app. The crawler talks to the platform over a dedicated, server-to-server
Knowledge-base API under /api/v1/training/. The lifecycle is:
- Pull the queue —
GET /api/v1/training/queue returns sources that are
due to be crawled (across active workspaces). ?limit= caps the page size
(default 50, maximum 200).
- Claim a source —
POST /api/v1/training/sources/{id}/claim moves a due
source into the crawling state. The request body carries the owning
tenant_id.
- Report progress —
POST /api/v1/training/sources/{id}/progress streams
live page counts (pages_indexed, optional pages_total) while the crawl
runs; these drive the progress bar in the list.
- Report the result —
POST /api/v1/training/sources/{id}/result
finalises the crawl, moving the source to ready or failed. On
success, the next crawl is automatically scheduled from the source’s
recrawl schedule; on failure, the error is stored and no re-crawl is
scheduled.
This API is authenticated with a deploy-wide bearer token
(ENGINE_API_TOKEN), shared with the rest of the service-to-service surface,
and every mutating call is written to the audit log. You don’t need to call it
directly — it exists so a crawler can be hosted and scaled independently of the
admin app.
Integrating a crawlerThe full request and response shapes for these endpoints, plus the auth
contract, live in the Engine API reference.
Permissions
| Resource | Grants |
|---|
comerix.training.view | View sources and indexed snippets. |
comerix.training.manage | Add, edit, re-crawl, and delete sources. |
Grant these per role on Users, roles & permissions.