HeadlinesBriefing favicon HeadlinesBriefing.com

TurboQuant squeezes AI vectors into 2‑4 bits without losing accuracy

Hacker News •
×

TurboQuant, an open‑source quantization scheme, compresses the high‑dimensional vectors that power large language models down to 2–4 bits per number without sacrificing accuracy. It achieves this by applying a random rotation that forces each coordinate into a predictable Gaussian‑like distribution, then reusing a single codebook for every vector. The method adds no extra memory for scale factors and requires no additional training or calibration.

Modern transformers store massive KV caches, token embeddings, and attention keys as floating‑point tensors. Prior compression attempts either introduced bias into inner‑product calculations or demanded costly fine‑tuning. TurboQuant’s unbiased estimator preserves expected inner products, meaning attention scores and nearest‑neighbor searches remain faithful. Benchmarks on OpenAI embeddings show mean‑squared error comparable to full‑precision storage while using a fraction of the bandwidth.

At its core, the scheme relies on two geometric facts: random rotations yield coordinates that follow a known distribution, and in high dimensions most components cluster near ±1/√d. This concentration lets a single codebook achieve near‑optimal distortion across all inputs. By turning vector quantization into a plug‑and‑play preprocessing step, TurboQuant offers immediate memory savings for any model that manipulates large embeddings.