How does Onyx work?

Onyx is built on Retrieval-Augmented Generation (RAG) — but goes well beyond a basic implementation. The retrieval pipeline incorporates techniques from Anthropic's engineering research that meaningfully improve answer accuracy over off-the-shelf approaches.

Step 1 — Indexing your content

When you add a connection, Onyx processes that content through a multi-stage pipeline:

Discovering — For web sources, the crawler discovers pages via sitemaps or by following links within your domain. For files, the content is extracted directly.
Extracting — HTML, boilerplate, and navigation noise are stripped down to clean, readable text.
Processing — Documents are split into semantically coherent chunks. Each chunk is then enriched with a generated context summary — a short description of where that chunk sits within the broader document. This context is embedded with the chunk rather than stored separately, so retrieval understands not just what a passage says but what it's about.
Comparing — Each chunk is diffed against what's already indexed to detect changes. Only new or modified chunks are re-embedded, keeping re-index runs fast and cost-efficient.
Embedding — Changed chunks are converted into vector embeddings using OpenAI's text-embedding-3-small model and written to the index alongside their contextual summaries.
Saving — The index is committed. The connection status updates to Indexed once complete.

Step 2 — Answering a question

When a user sends a message in one of your configured Discord channels:

Intent classification — Onyx first checks whether the message is a genuine question or support request. Greetings, reactions, and off-topic messages are silently ignored. You are never charged for filtered messages.
Hybrid search — For genuine questions, Onyx searches using both semantic vector similarity and BM25 lexical matching simultaneously. The results from both methods are fused together. Anthropic's research found that combining vector search with BM25 achieves 49% better retrieval compared to either method alone — the hybrid approach catches what pure semantic search misses and vice versa.
Structured context assembly — Retrieved chunks are passed to the language model in a structured format that clearly separates source metadata, chunk context, and raw content. This reduces hallucination risk by giving the model explicit signals about what each piece of information is and where it came from.
Answer generation — The model generates a clear, concise answer anchored to your indexed content and posts it in Discord — inline or in a thread, depending on your channel configuration.

Conversation memory

Onyx maintains context across a conversation so users can ask natural follow-up questions without re-stating context.

In threaded conversations, Onyx has access to the full thread history. For inline channel responses, it looks back at recent messages from the same user within a short time window.

When a follow-up contains an ambiguous reference — "it", "that", "this" — Onyx rewrites the query into a fully standalone search before hitting the index. A question like "how do I configure it?" gets resolved to the actual subject before retrieval, so the right chunks come back.

AI models

Onyx uses Gemini 2.5 Flash by default — fast, accurate, and the most cost-efficient option available. Starter, Growth, and Scale plan users can switch to a different model from the Configuration page, including models from Google, Anthropic, and OpenAI.

Keeping answers fresh

You can re-index any connection at any time from your dashboard. Deep crawls (full site re-indexes) are subject to a per-plan cooldown; single-page crawls have a daily limit that varies by plan.

If Onyx has gone quiet in your server, see Bot Not Responding for step-by-step troubleshooting.