JarvisBitz Tech
How AI Works

How AI inference works.

From prompt to prediction — the real-time architecture behind every AI response.

The Moment AI Thinks

What is inference?

Training teaches a model to recognize patterns. Inference is when the model uses those patterns to generate a response — in real time, for every request.

Training

Offline, expensive, infrequent

  • Weeks to months of GPU time
  • Billions of data points processed
  • Learns weights and parameters
  • Runs once (or periodically)
Typical cost: $100K - $10M+

Inference

Real-time, per-request, continuous

  • Milliseconds per request
  • Single input → single output
  • Uses frozen model weights
  • Runs millions of times per day
Typical cost: $0.001 - $0.10 per call
Latency
Time to first token

How fast the model starts responding. Critical for interactive use.

Throughput
Tokens per second

Total generation bandwidth. Determines how many users you can serve.

Cost / Token
Compute economics

GPU-hours per million tokens. The unit economics of AI deployment.

Technical Stack

Inference architecture

The journey from user prompt to generated response. Every millisecond is accounted for.

Request

~1ms

User prompt arrives via API

Load Balancer

~2ms

GPU Cluster

~5ms

Model Runtime

~50-500ms

Response

~1ms
END-TO-END: ~60-510ms
Stage: Request ~1ms

Quantization: precision vs speed

Reducing numerical precision shrinks model size and speeds up inference — at the cost of subtle quality trade-offs.

FP32
32-bit
FP16
16-bit
INT8
8-bit
INT4
4-bit
Performance Engineering

Optimization techniques

Four key techniques that make production inference fast, efficient, and cost-effective.

3-10× smaller

Model Distillation

Train a smaller "student" model to replicate a large "teacher" model's behavior. The student learns the probability distributions, not just hard labels — capturing nuance at a fraction of the compute.

Model size reduction
2-3× faster

Speculative Decoding

A small draft model generates candidate tokens quickly. The large model verifies them in parallel — accepting correct guesses and only regenerating wrong ones. Speeds up autoregressive generation without quality loss.

Generation speed
N× memory

Tensor Parallelism

Split model weight matrices across multiple GPUs so each processes a slice simultaneously. Enables running models too large for a single GPU while keeping latency low through synchronized computation.

Linear GPU scaling
2-4× faster

Flash Attention

Reorders attention computation to maximize GPU memory bandwidth utilization. By tiling and fusing operations, it avoids writing large intermediate matrices to slow HBM — achieving 2-4× speedups on long sequences.

Attention computation
Deployment Patterns

Where to run inference

Three deployment strategies, each with distinct trade-offs in latency, cost, and control.

Cloud API

Managed inference via providers like OpenAI, Anthropic, or Google. Zero infrastructure overhead — pay per token.

Latency
100-500ms
Cost
$$ per token
Zero ops overhead
Always latest models
Auto-scaling
Data leaves your network
Rate limits
Vendor lock-in

Self-Hosted

Run models on your own GPU infrastructure. Full control over data, latency, and cost at scale.

Latency
20-200ms
Cost
$$$ fixed + ops
Data sovereignty
No rate limits
Custom models
GPU procurement
Ops complexity
Capacity planning

Edge Deployment

Small quantized models running on-device or at the network edge. Ultra-low latency, offline-capable.

Latency
5-50ms
Cost
$ per device
Ultra-low latency
Offline capable
Privacy by default
Smaller models only
Limited context
Update complexity

See inference in action.

Watch our AI systems process requests in real time — from prompt to structured output.