JarvisBitz Tech
How AI Works

How Retrieval-Augmented Generation Works

Ground AI answers in your real data — with citations, freshness, and zero hallucination tolerance.

What is RAG?

Retrieve, Augment, Generate

RAG bridges the gap between an LLM’s general knowledge and your specific domain data — producing answers grounded in verifiable sources.

01

Query

User asks a question

A natural-language question or instruction enters the system. The query is analyzed for intent, entities, and scope before routing into the retrieval pipeline.

02

Retrieve

Find relevant documents

03

Augment

Inject context into prompt

04

Generate

Grounded answer + citations

RAG PIPELINE
Stage 01/04 Query

Reduces hallucination

Answers are grounded in retrieved documents, not fabricated from training data.

Uses your data

Proprietary knowledge, internal docs, and domain data — without retraining the model.

Provides citations

Every answer traces back to source documents with verifiable references.

Stays current

Knowledge is updated by re-indexing, not by retraining billion-parameter models.

Architecture

Two pipelines, one system

RAG consists of an offline ingestion pipeline that prepares knowledge, and a real-time query pipeline that retrieves and generates.

Ingestion (Offline)
01

Source Documents

PDFs, wikis, APIs, databases

02

Chunk

Semantic splitting with overlap

03

Embed

Dense vector representation

04

Index & Store

Vector DB + metadata

Query (Real-time)
01

Embed Query

Same model as ingestion

02

Similarity Search

ANN over vector index

03

Re-Rank

Cross-encoder scoring

04

Context Assembly

Deduplicate, truncate, order

→ LLM PROMPT = CONTEXT + QUERY
Core Concept

Vector embeddings explained

Embeddings convert text into numerical vectors where meaning is encoded as position. Similar concepts cluster together in this high-dimensional space.

2D projection of vector space

Similar concepts cluster — hover to highlight groups

RevenueProfitSalesDogCatPuppyPythonJavaTypeScriptContractAgreementPolicyDIMENSION 1DIMENSION 2

How it works

1

Text → Tokens

Input text is tokenized into sub-word units.

2

Tokens → Vector

An embedding model maps each token sequence to a single dense vector.

3

Vector → Space

The vector lives in a high-dimensional space where distance = semantic similarity.

4

Query → Nearest Neighbors

At search time, the query vector finds the closest document vectors.

Embedding models

text-embedding-3-largeOpenAI
3,072d
embed-v3Cohere
1,024d
voyage-large-2Voyage AI
1,536d
BGE-M3BAAI
1,024d
Quality Gates

Four quality controls

Every RAG response is measured against these metrics. If any drops below threshold, the system alerts before users are affected.

94%

Retrieval Precision

Fraction of retrieved chunks that are genuinely relevant. High precision means the LLM receives clean, on-topic context — minimizing noise that could mislead generation.

91%

Answer Faithfulness

Measures whether every claim in the generated answer is supported by the retrieved source context. Unfaithful answers introduce information not present in the documents.

97%

Citation Accuracy

Percentage of inline citations that correctly point to the source chunk they claim to reference. Incorrect citations erode trust even when the answer itself is correct.

96%

Hallucination Detection

Rate at which the system identifies and flags claims that are not supported by any retrieved source — catching fabrications before they reach the user.

Decision Framework

When to use RAG

RAG is one of three strategies for adapting LLMs. Choose based on data freshness requirements, domain complexity, and infrastructure budget.

Recommended

RAG

Retrieve external documents at query time and inject them as context. No model retraining required.

Advantages
Uses latest dataTransparent citationsNo training costEasy to update
Trade-offs
Retrieval latencyContext window limitsDepends on index quality
BEST FORDynamic knowledge, compliance, Q&A over docs

Fine-Tuning

Train the model on domain data to internalize patterns, style, and knowledge into its weights.

Advantages
Faster inferenceConsistent styleCompressed knowledgeWorks offline
Trade-offs
Expensive trainingStatic knowledgeRisk of catastrophic forgetting
BEST FORStable domains, custom tone, high-volume tasks

Prompt Engineering

Craft instructions and examples in the prompt to guide the model without any data pipeline.

Advantages
Zero infrastructureInstant iterationNo data prepUniversal applicability
Trade-offs
Limited to model knowledgeInconsistent on edge casesToken-limited context
BEST FORGeneral tasks, prototyping, broad applicability
Core Principle

“No source, no claim.”

If the retriever cannot surface a supporting document, the model is instructed to say so — never to fabricate.

Build a RAG system on your data.

Describe your knowledge sources and the AI will map this architecture to your environment.