← Back to Learn

Attention & latency

Why does longer context make AI slower? Self-attention scales as O(n²) — double your tokens and you quadruple the compute. Use the slider to feel the math.

O(n²) causal attention simulator safe zone
20,000
10%
0.04×attention cost
10%window filled
Fastresponse zone
ContextCrunch compression benefit
30%
Causal-Masked Self-Attention Weight Matrix Map

Each block represents the attention relationship between chat turn query i and historical key j.

Diagonal & Recent (Bright): Focus is naturally strong on the current turn and nearby history.

Causal Mask (Blank): Right-hand upper triangle is masked out (value = 0) because current queries cannot look into future messages.

Past Context (Fading): As the grid expands quadratically, older history fades to dark navy due to softmax token dispersion.