HeadlinesBriefing favicon HeadlinesBriefing.com

Gemma 4 and Rivals Pack KV-Sharing Tricks for Long Contexts

Hacker News •
×

Google's Gemma 4 suite dropped in April with a pair of architectural tricks aimed at cutting KV cache bloat in long-context scenarios. The E2B and E4B variants reuse key-value projections across transformer layers rather than recomputing them, which saves roughly 2.7 GB for the smaller model and about 6 GB at 128K context. A second design choice, per-layer embeddings, adds more granular control over how each layer processes tokens.

This wave of efficiency moves didn't start with Google. DeepSeek V4 layers in mHC and compressed attention, while ZAYA1-8B deploys compressed convolutional attention. Laguna XS.2 budgets attention per layer. The common thread: reasoning models and agent workflows hold tokens longer, so KV cache size, memory traffic, and attention cost have become the real bottlenecks for open-weight LLMs chasing long-context capability.

Most of these changes look minor on an architecture diagram, but several are intricate under the hood. Google's cross-layer KV sharing reuses tensors across up to 20 consecutive layers in the E2B variant, trading a bit of model capacity for measurable memory savings. Developers are clearly betting that approximation tricks at the transformer level matter more than raw parameter counts.