How AI Works

How AI inference works.

From prompt to prediction — the real-time architecture behind every AI response.

The Moment AI Thinks

What is inference?

Training teaches a model to recognize patterns. Inference is when the model uses those patterns to generate a response — in real time, for every request.

Training

Offline, expensive, infrequent

Weeks to months of GPU time
Billions of data points processed
Learns weights and parameters
Runs once (or periodically)

Typical cost: $100K - $10M+

Inference

Real-time, per-request, continuous

Milliseconds per request
Single input → single output
Uses frozen model weights
Runs millions of times per day

Typical cost: $0.001 - $0.10 per call

Latency

Time to first token

How fast the model starts responding. Critical for interactive use.

Throughput

Tokens per second

Total generation bandwidth. Determines how many users you can serve.

Cost / Token

Compute economics

GPU-hours per million tokens. The unit economics of AI deployment.

Technical Stack

Inference architecture

The journey from user prompt to generated response. Every millisecond is accounted for.

→

Request

~1ms

User prompt arrives via API

→

⇋

Load Balancer

~2ms

→

◈

GPU Cluster

~5ms

→

⚙

Model Runtime

~50-500ms

→

←

Response

~1ms

END-TO-END: ~60-510ms

Stage: Request — ~1ms

Quantization: precision vs speed

Reducing numerical precision shrinks model size and speeds up inference — at the cost of subtle quality trade-offs.

FP32

32-bit

FP16

16-bit

INT8

8-bit

INT4

4-bit

Performance Engineering

Optimization techniques

Four key techniques that make production inference fast, efficient, and cost-effective.

3-10× smaller

Model Distillation

Train a smaller "student" model to replicate a large "teacher" model's behavior. The student learns the probability distributions, not just hard labels — capturing nuance at a fraction of the compute.

Model size reduction

2-3× faster

Speculative Decoding

A small draft model generates candidate tokens quickly. The large model verifies them in parallel — accepting correct guesses and only regenerating wrong ones. Speeds up autoregressive generation without quality loss.

Generation speed

N× memory

Tensor Parallelism

Split model weight matrices across multiple GPUs so each processes a slice simultaneously. Enables running models too large for a single GPU while keeping latency low through synchronized computation.

Linear GPU scaling

2-4× faster

Flash Attention

Reorders attention computation to maximize GPU memory bandwidth utilization. By tiling and fusing operations, it avoids writing large intermediate matrices to slow HBM — achieving 2-4× speedups on long sequences.

Attention computation

Deployment Patterns

Where to run inference

Three deployment strategies, each with distinct trade-offs in latency, cost, and control.

Cloud API

Managed inference via providers like OpenAI, Anthropic, or Google. Zero infrastructure overhead — pay per token.

Latency

100-500ms

Cost

$$ per token

✓Zero ops overhead

✓Always latest models

✓Auto-scaling

✗Data leaves your network

✗Rate limits

✗Vendor lock-in

Self-Hosted

Run models on your own GPU infrastructure. Full control over data, latency, and cost at scale.

Latency

20-200ms

Cost

$$$ fixed + ops

✓Data sovereignty

✓No rate limits

✓Custom models

✗GPU procurement

✗Ops complexity

✗Capacity planning

Edge Deployment

Small quantized models running on-device or at the network edge. Ultra-low latency, offline-capable.

Latency

5-50ms

Cost

$ per device

✓Ultra-low latency

✓Offline capable

✓Privacy by default

✗Smaller models only

✗Limited context

✗Update complexity

We also build

Web Development Mobile Apps SaaS Platforms Integrations All Services

See inference in action.

Watch our AI systems process requests in real time — from prompt to structured output.

Explore live demos Talk to an architect