How AI inference works.
From prompt to prediction — the real-time architecture behind every AI response.
What is inference?
Training teaches a model to recognize patterns. Inference is when the model uses those patterns to generate a response — in real time, for every request.
Training
Offline, expensive, infrequent
- Weeks to months of GPU time
- Billions of data points processed
- Learns weights and parameters
- Runs once (or periodically)
Inference
Real-time, per-request, continuous
- Milliseconds per request
- Single input → single output
- Uses frozen model weights
- Runs millions of times per day
How fast the model starts responding. Critical for interactive use.
Total generation bandwidth. Determines how many users you can serve.
GPU-hours per million tokens. The unit economics of AI deployment.
Inference architecture
The journey from user prompt to generated response. Every millisecond is accounted for.
Request
~1msUser prompt arrives via API
Load Balancer
~2msGPU Cluster
~5msModel Runtime
~50-500msResponse
~1msQuantization: precision vs speed
Reducing numerical precision shrinks model size and speeds up inference — at the cost of subtle quality trade-offs.
Optimization techniques
Four key techniques that make production inference fast, efficient, and cost-effective.
Model Distillation
Train a smaller "student" model to replicate a large "teacher" model's behavior. The student learns the probability distributions, not just hard labels — capturing nuance at a fraction of the compute.
Speculative Decoding
A small draft model generates candidate tokens quickly. The large model verifies them in parallel — accepting correct guesses and only regenerating wrong ones. Speeds up autoregressive generation without quality loss.
Tensor Parallelism
Split model weight matrices across multiple GPUs so each processes a slice simultaneously. Enables running models too large for a single GPU while keeping latency low through synchronized computation.
Flash Attention
Reorders attention computation to maximize GPU memory bandwidth utilization. By tiling and fusing operations, it avoids writing large intermediate matrices to slow HBM — achieving 2-4× speedups on long sequences.
Where to run inference
Three deployment strategies, each with distinct trade-offs in latency, cost, and control.
Cloud API
Managed inference via providers like OpenAI, Anthropic, or Google. Zero infrastructure overhead — pay per token.
Self-Hosted
Run models on your own GPU infrastructure. Full control over data, latency, and cost at scale.
Edge Deployment
Small quantized models running on-device or at the network edge. Ultra-low latency, offline-capable.
See inference in action.
Watch our AI systems process requests in real time — from prompt to structured output.