HeadlinesBriefing favicon HeadlinesBriefing.com

Huawei launches KVarN: high‑capacity KV‑cache for vLLM agents

Hacker News •
×

Huawei's CSL released KVarN, a native vLLM KV‑cache quantization backend. It claims 3‑5× more cache capacity and up to 1.3× FP16 throughput while keeping FP16‑level accuracy, all without calibration. Users enable it with a single flag, no model changes. It integrates seamlessly with existing pipelines and supports multi‑GPU deployments for larger workloads.

Existing KV‑cache quantization methods, such as the TurboQuant approach, trade capacity for speed or sacrifice accuracy, making them unattractive for production. KVarN avoids that trade‑off by rotating each tile with a Hadamard matrix, normalizing variance, then applying asymmetric low‑bit rounding—4‑bit keys and 2‑bit values. On the Qwen‑3‑32B model it matches FP16 accuracy and delivers roughly four times the cache space.

vLLM ships as a fork of the original library and can be installed with a single pip command; users specify kv_cache_dtype=kvarn_k4v2_g128 and block_size=128 to activate it. Because it runs in float16 compute, existing hardware sees higher throughput without code changes. The release demonstrates that aggressive KV‑cache quantization can scale long‑context agents while preserving model quality, making it a practical drop‑in for production clusters.