HeadlinesBriefing favicon HeadlinesBriefing.com

Anthropic Breaks LLM Black‑Box Myth with Circuit Tracing

Hacker News •
×

Anthropic’s 2025 paper, On the Biology of a Large Language Model, claims LLMs are not opaque black boxes. The study pushes mechanistic interpretability forward by training a duplicate model to sparsely replicate the base model’s MLP outputs. These sparse activations map cleanly to human‑readable concepts like “Texas” or “the Olympics” in real time during inference.

Researchers then trace how these concepts interact during a forward pass, essentially wiring a computational diagram. The method reveals models perform genuine multi‑step reasoning: asking for Dallas’s state capital activates Dallas, then Texas, then Austin. Such stepwise inference mirrors pseudo‑symbolic reasoning and shows LLMs build semantic chains internally in the context of natural language generation tasks.

The technique extends beyond language models. In 2022, DeepMind showed AlphaZero’s hidden layers matched human chess concepts such as “in check” or “pinning” without any chess training. These findings suggest that mechanistic decoding can uncover domain‑specific knowledge across AI systems, offering a window into how models generalize for researchers and developers.

Beyond curiosity, understanding internal reasoning lets engineers steer behavior and spot missteps. For example, Claude 3.5 Haiku’s integer‑addition routine splits the task into parallel pathways, a non‑human‑like algorithm that still delivers correct answers. By mapping such hidden strategies, teams can redesign training to favor more efficient, transparent methods for safety and fairness.