JarvisBitz Tech
How AI Works

How Large Language Models Work

A technical deep-dive into the architecture that powers modern AI — from tokenization to generation.

What is an LLM?

From text to intelligence

A Large Language Model transforms raw text through five stages — each adding a deeper layer of understanding.

01

Tokenization

Raw text is split into sub-word tokens using a byte-pair encoding (BPE) vocabulary. "Understanding" might become ["under", "stand", "ing"]. This lets the model handle any word — even ones it has never seen — by composing known fragments.

Technology
BPESentencePiece~100K vocab
PIPELINE
Stage 01/05 · Tokenization

Tokens

The atomic units of text the model processes. A single word might be one token or several. GPT-4 uses roughly 1 token per ¾ of an English word.

~100Kvocabulary size

Embeddings

Dense vector representations that capture the meaning of tokens. Similar concepts cluster together in a high-dimensional space, enabling mathematical reasoning over language.

12,288dimensions

Attention

The mechanism that lets the model weigh the relevance of every token to every other token. It's how "it" in a sentence connects back to the noun it refers to, across hundreds of tokens.

128attention heads

Context Window

The maximum number of tokens the model can consider at once. Larger windows let the model reason over entire documents but increase compute cost quadratically with standard attention.

128K+token context
Architecture

The Transformer architecture

Published in 2017 as “Attention Is All You Need,” the Transformer replaced recurrence with parallelizable self-attention — enabling models to scale to billions of parameters.

Input Embeddings + Positional Encoding

Multi-Head Self-Attention

Add & Layer Norm

Feed-Forward Network (SwiGLU)

Add & Layer Norm

Output Linear + Softmax

× 80–120 layers in frontier models

Self-Attention Visualized

Click a source word to see where the model “looks”

ThecatsatonthematThecatsatonthematSOURCE (click to change)ATTENTION TARGETS
Attention from “cat
The (5%)
cat (40%)
sat (15%)
on (5%)
the (5%)
mat (30%)
Capabilities

What LLMs can do

The same architecture unlocks fundamentally different capabilities depending on how it’s prompted and deployed.

Text Generation

Autoregressive next-token prediction allows LLMs to write fluent prose, dialogue, marketing copy, and structured documents in any style or tone.

The → quick → brown → fox → ...

Reasoning & Analysis

Chain-of-thought prompting unlocks multi-step logical reasoning — breaking complex problems into sequential sub-problems the model solves one at a time.

Premise → Step 1 → Step 2 → Conclusion

Code Generation

Trained on billions of lines of code, LLMs can write, debug, refactor, and explain software across dozens of programming languages with context awareness.

fn main() { ... }

Summarization & Extraction

LLMs condense long documents while preserving key information, extract structured data from unstructured text, and identify entities, sentiment, and intent.

10,000 words → 3 key insights
Practical Application

How we use LLMs

Three strategies for adapting LLMs to your domain — from zero-effort prompting to full fine-tuning.

Prompt Engineering

Craft precise instructions and few-shot examples to guide model behavior. Zero infrastructure required — the fastest path to value.

WHENGeneral tasks, quick iteration, broad applicability
EFFORTLow
Selected

Retrieval-Augmented Generation

Ground the model in your proprietary data by retrieving relevant documents at query time. Keeps knowledge current without retraining.

WHENDomain-specific Q&A, knowledge bases, compliance
EFFORTMedium

Fine-Tuning

Train the model's weights on your specific data to internalize domain patterns, tone, and formatting that prompting alone cannot achieve.

WHENConsistent style, specialized domains, high volume
EFFORTHigh

Safety and alignment

Raw capability is not enough. These mechanisms keep LLMs reliable and safe.

RLHF alignment

Reinforcement Learning from Human Feedback steers outputs toward helpful, harmless, and honest responses.

Guardrails & filters

Input/output classifiers detect and block toxic, biased, or policy-violating content in real time.

Red-teaming

Adversarial testing by humans and automated probes surfaces failure modes before production deployment.

Constitutional AI

Self-critique loops let the model revise its own outputs against a set of explicit principles.

Landscape 2025

The model landscape today

Frontier models continue to scale in size, context length, and capability — while inference costs drop by 10× year-over-year.

175B+

Parameters in frontier models

GPT-4, Claude 3.5, Gemini Ultra

128K+

Token context windows

Some models reach 1M+ tokens

<100ms

Inference latency (p50)

With optimized serving infrastructure

95%+

Benchmark accuracy

On standard NLP evaluation suites

See LLMs at work on your use case.

Describe your challenge and the AI will recommend the right model, strategy, and architecture.