HeadlinesBriefing favicon HeadlinesBriefing.com

Glitches in Transformer Models Revealed

Towards Data Science •
×

Transformer models, the backbone of many foundation models, face a persistent issue with high-norm artifacts. These artifacts, identified in Vision Transformers (ViTs), can cause significant performance drops in tasks like object detection. Researchers discovered that these artifacts, characterized by unusually high L2 norms, often appear in low-information areas of images. They can reduce model performance by up to 20% in specific tasks, such as unsupervised object discovery.

The root cause of these artifacts is debated, with theories suggesting they result from the model's attempt to process global information or a byproduct of the Softmax function in attention mechanisms. Recent research has proposed solutions, including the addition of 'register' tokens, which provide a dedicated space for global information. However, these solutions often require retraining, prompting the development of post-hoc methods that can fix existing models.

Latest advancements include Denoising Vision Transformers (DVT), which clean output tokens post-hoc, and modifications to the self-attention architecture. These innovations are already being incorporated into models like Qwen3-Next. As these artifacts are not unique to ViTs but also affect Large Language Models (LLMs), the solutions developed could have broader implications across the AI community.

Understanding these artifacts and their mitigations is essential for leveraging the full potential of transformer models. As research progresses, we can expect more efficient and effective solutions, enhancing the reliability and performance of AI models across various applications.