HeadlinesBriefing favicon HeadlinesBriefing.com

Reducing Transformer Memory with QKV Projection Sharing

Hacker News •
×

Researchers examined whether transformers actually need three separate projections for query, key, and value. By testing shared constraints across vision tasks and language models, they found that sharing weights doesn't necessarily hurt performance. This study challenges the standard QKV architecture by proving that some projections are redundant during attention operations.

Sharing keys and values (Q-K=V) provides the most benefit. In language modeling tests, this approach achieved a 50% KV cache reduction with only a 3.1% drop in perplexity. This happens because keys and values often occupy similar representational spaces, allowing them to share weights without breaking the model's internal logic.

Combining this method with Grouped-Query Attention (GQA-4) drops cache usage by 87.5%. When paired with Multi-Query Attention (MQA), the cache reduction reaches 96.9%. These memory gains make large models more viable for on-device inference. The authors conclude that weight tying in attention is an effective way to optimize edge deployment.

This systematic evaluation shows that Q=K-V fails because it breaks attention directionality, unlike the successful Q-K=V variant. The team tested these findings on 1.2B parameter models trained on 10B tokens. This research provides a clear path for reducing memory overhead in production AI.