Training on Documents (PDF, TXT, MD, XML, CSV)

Some answers live in a file, not on a published page — a staff handbook, a product manual, a price list, a legal policy. Document training lets you upload those files so PurioChat can learn from them, even when they never appear on your site.

Pro feature. Document training is available in PurioChat Pro.

Where to find it

Open PurioChat → Data Training, switch to the Database tab, and open the Document Manager. From there you can upload files, train them, and manage everything you’ve added.

Supported file types

PurioChat reads each file and pulls out the plain text so the AI can search and quote from it.

Format	Extension	How the text is read
PDF	`.pdf`	Text is extracted from the pages. Image-only, scanned, or encrypted PDFs can’t be read — you’ll see an error if no selectable text is found.
Plain text	`.txt`	Read directly, as-is.
Markdown	`.md`	Read directly, as-is.
XML	`.xml`	Parsed and flattened into readable `tag: value` lines.
CSV	`.csv`	Each row is turned into readable `Header: value` lines.

Each file can be up to 50 MB, and you can select multiple files at once.

Heads up: If your PDF is a scan or photo with no real text behind the image, there’s nothing to extract. Run it through OCR first, or upload a text-based version.

How training works: two steps

Training happens in two steps. This keeps large uploads from timing out and lets you review what you’ve added before spending API credits.

Upload. PurioChat extracts the text from each file, splits it into smaller pieces, and saves them. No embeddings are generated yet, so this step doesn’t call your AI provider.
Train Now. Click Train Now to generate the embeddings that make your documents searchable. PurioChat works in batches (about 10 pieces per batch) so even big files process reliably.

Tip: Uploading a file isn’t the same as training it. A document only becomes searchable in chat after you click Train Now.

Document Manager modal in Data Training showing the file picker, an uploaded file with a Trained status badge and chunk count, and the Train Now and Delete buttons

How documents are split and stored

Long files are broken into smaller, overlapping pieces (called “chunks”) so the AI can pinpoint the most relevant passage instead of reading a whole document at once. The split is based on character length: files up to roughly 7,000 characters stay as a single piece, and longer files are divided into pieces of about 3,500 characters each, with a little overlap so context isn’t lost between them.

Each piece is saved as a hidden, non-public custom post type. Your documents live inside WordPress for the AI to use, but they never show up on your front end, in search results, or in your sitemap.

Document statuses

Every file in the Document Manager shows a status:

Trained — all pieces have embeddings and are fully searchable.
Partial — some pieces are trained and some aren’t. Click Train Now again to finish the rest.
Pending — the file is uploaded but no embeddings exist yet. Click Train Now to start.

Each row also shows how many pieces are indexed out of the total, so you can track progress at a glance.

Deleting a document

Use the Delete button next to any file. This removes all of that document’s pieces and their embeddings, so the AI immediately stops using it. To replace a file with an updated version, delete the old one first, then upload and train the new copy.

Good things to upload

Document training works best for reference material visitors ask about but that isn’t on a public page:

Company policies, procedures, and staff handbooks
Product manuals and user guides
FAQ documents and internal help articles
Terms and conditions, privacy notices, and other legal documents