HeadlinesBriefing favicon HeadlinesBriefing.com

Anthropic maps LLM persona space to stabilize Assistant character

Hacker News: Front Page •
×

A new paper from Anthropic researchers maps how large language models organize internal character representations. By analyzing neural activations across models like Llama 3.3 70B, they identified an Assistant Axis that defines a persona space. This axis separates helpful, professional archetypes from more fantastical roles, revealing a generalizable structure in model internals.

The research shows this axis exists even before post-training, suggesting the Assistant character inherits traits from human archetypes like therapists and consultants. By steering models along this axis, researchers demonstrated a causal link: moving away from the Assistant made models adopt alternative identities and fabricated backstories, while capping drift stabilized behavior and prevented harmful outputs.

This work provides a concrete tool for monitoring and controlling model personas. By tracking neural activity along the Assistant Axis, developers can detect when a model begins to drift from its intended character. The activation-capping technique offers a practical method to constrain behavior, addressing a key challenge in deploying reliable and safe AI assistants.