Skip to main content
The Training base is the knowledge base that grounds your AI agents’ answers. It lives at /admin/training (route admin_training_index, menu Knowledge → Training base). The Training base.

What it’s for

Register sources — websites your AI agents should learn from. Each source is crawled, its text extracted and indexed, so your flows can retrieve relevant snippets at answer time and respond with grounded, up-to-date context instead of guessing. A source picks how it is discovered:
TypeWhat gets indexed
WebsiteStart at the URL and follow every reachable link on the same domain, up to the configured crawl depth.
SitemapFetch the sitemap and index every URL it lists.
Single pageIndex just the one URL you provide — nothing else.
Using a source in a flowA flow references the training base from an AI agent node to inject the top matches as grounded context. Indexing a source here is what makes that content available for retrieval.

Adding a source

In the Add a source panel:
  1. Choose a source type — Website, Sitemap, or Single page.
  2. Paste the URL (for example https://www.example.com).
  3. Click Save.
The source is created in the scheduled state and queued for crawling. A flash confirms it: “Source example.com queued for crawling.”

Advanced settings

Expand Advanced settings to tune the crawl before you save:
FieldWhat it doesDefault
Crawl depthHow many link hops away from the starting URL to follow. Accepted range is 1–20.3
Recrawl scheduleHow often the source is re-fetched: Manual, Daily, Weekly, or Monthly.Manual
Include patternsComma-separated paths the crawler is allowed to visit. Globs supported (e.g. /blog/*, /docs/*).
Exclude patternsPaths the crawler must skip — useful for cart, checkout, and search pages (e.g. /cart/*, /checkout/*, *.pdf).

URL validation

Every URL is checked before the source is saved:
  • It must be a well-formed http(s) URL with a host. Otherwise you get “Enter a valid http(s) URL.”
  • The host must not point inside the platform’s own network — localhost, *.local, *.localhost, and private or reserved IP addresses are rejected with “That host can’t be crawled.” This is a first line of defence against pointing the crawler at an internal address.
  • The same URL can’t be added twice in one workspace; a duplicate is rejected with “That URL is already in your training base.”

The source list

The list shows, per source:
ColumnMeaning
SourceThe host and the full starting URL.
TypeWebsite, Sitemap, or Single page.
StatusThe current crawl state — see below.
PagesPages indexed. While crawling, shows indexed / total with a progress bar.
Last crawlWhen the source last finished crawling, or queued if it hasn’t run yet.
ScheduleManual, Daily, Weekly, or Monthly.
Three header metrics summarise the workspace: total Sources, total Pages indexed, and the Last crawl time. Use the All / Ready / Crawling / Failed filter chips above the list to narrow it by status.

Statuses

StatusMeaning
ScheduledQueued and waiting to be picked up by the crawler.
CrawlingA crawler has claimed the source and is fetching pages; the row shows live page counts and a progress bar.
ReadyThe crawl finished successfully; the source is indexed and available for retrieval.
FailedThe crawl failed; the row is flagged and the last error is recorded.

Re-crawl and delete

Each row has two actions (visible to users with manage permission):
  • Re-crawl resets the source to scheduled, marks it due now, and clears any previous error so the crawler picks it up on its next pass. A flash confirms “Re-crawl queued.” Use this after the source content changes or after a failure.
  • Delete removes the source from your workspace (“Source removed.”).
A third action opens the source URL in a new browser tab.

How crawling happens

This page only owns the operator UI and the source records it writes — crawling itself is performed by an external crawler service, not by the admin app. The crawler talks to the platform over a dedicated, server-to-server Knowledge-base API under /api/v1/training/. The lifecycle is:
  1. Pull the queueGET /api/v1/training/queue returns sources that are due to be crawled (across active workspaces). ?limit= caps the page size (default 50, maximum 200).
  2. Claim a sourcePOST /api/v1/training/sources/{id}/claim moves a due source into the crawling state. The request body carries the owning tenant_id.
  3. Report progressPOST /api/v1/training/sources/{id}/progress streams live page counts (pages_indexed, optional pages_total) while the crawl runs; these drive the progress bar in the list.
  4. Report the resultPOST /api/v1/training/sources/{id}/result finalises the crawl, moving the source to ready or failed. On success, the next crawl is automatically scheduled from the source’s recrawl schedule; on failure, the error is stored and no re-crawl is scheduled.
This API is authenticated with a deploy-wide bearer token (ENGINE_API_TOKEN), shared with the rest of the service-to-service surface, and every mutating call is written to the audit log. You don’t need to call it directly — it exists so a crawler can be hosted and scaled independently of the admin app.
Integrating a crawlerThe full request and response shapes for these endpoints, plus the auth contract, live in the Engine API reference.

Permissions

ResourceGrants
comerix.training.viewView sources and indexed snippets.
comerix.training.manageAdd, edit, re-crawl, and delete sources.
Grant these per role on Users, roles & permissions.