HeadlinesBriefing favicon HeadlinesBriefing.com

Study Reveals Hidden Censorship in So‑Called Uncensored LLMs

Hacker News •
×

Researchers at the Hacker News Research Workbench released a benchmark called the “flinch,” measuring how safety‑filtered pre‑trains mute charged words without refusing. Their Polymarket test fine‑tuned a LoRA on an “uncensored” model to trade word‑based markets, yet the target term kept softening. Using a refusal‑ablated Qwen 3.5‑9B model marketed as “heretic,” they found it still down‑weights taboo words, questioning what “uncensored” really means.

The team probed 1,117 charged words across four carrier sentences, grouping them into six categories such as anti‑China, slurs and violence. Open‑data models like EleutherAI’s Pythia‑12B and AllenAI’s OLMo‑2‑13B formed a low‑flinch baseline, while commercial releases from Google (Gemma‑2‑9B, Gemma‑4‑31B) and Alibaba (Qwen‑base) showed markedly larger hexagonal profiles, indicating stronger suppression of the target terms.

Applying an “abliteration” step—removing the activation direction tied to refusal—produced a model that no longer blocks content but still exhibits a measurable flinch across the same probes. The result suggests that even advertised “uncensored” models retain implicit bias filters, meaning developers cannot rely on label alone to guarantee free expression of sensitive language.

The findings urge practitioners to audit model outputs rather than rely on marketing labels. Developers seeking truly open‑ended generation must consider additional mitigation strategies or accept that some level of implicit censorship persists even in “uncensored” releases. The benchmark also provides a reusable probe for future comparisons, giving the community a concrete metric to track how far future models drift from the open‑data baseline.