← Back to Learn

Quantization

How do we compress neural network weights and embeddings without losing meaning? The precision tradeoff that makes local ML models practical — and what LLMLingua uses inside ContextCrunch.

Quantization levels — storage comparison d=384 embedding · 100k vocab
32float32 baseline
1,536 bytes/vec · 146MB table
16float16
768 bytes · 2× smaller · ~0.1% loss
8int8 — sweet spot
384 bytes · 4× smaller · ~0.5% loss
4int4
192 bytes · 8× smaller · ~2% loss
1binary
48 bytes · 32× smaller · ~5% loss
Static example — float32 to int8
Original float32:   0.8734
Scale factor:       × 127
Quantized int8:    111
Dequantized:       0.874016
Error:             0.000516 (0.059% of original)
int8 is the production standard — 4x smaller storage, minimal loss, keeps dot-products matching.
Interactive Vector Precision Simulator FLOAT32 (Baseline)

See how a 10-dimensional embedding vector is packed, stored, and dequantized. Toggle between formats to measure standard size, Mean Squared Error (MSE), and cosine similarity retention.

10-Dimensional Reconstructed Vector Graph
40 bytesstored size
0.000000Mean Squared Error
1.00000Cosine Similarity
Precision Bitstream Details
Original Vector: [...]
Packed Stream: [...]
Reconstructed: [...]