HeadlinesBriefing favicon HeadlinesBriefing.com

Cross‑linguistic vocabularies follow universal statistical patterns

Hacker News •
×

A team spanning Fudan, Harvard, and Stony Brook University has mapped vocabulary evolution across 22 languages using word‑embedding vectors and spatial statistics. By converting millions of words into points in a 300‑dimensional semantic space, the researchers could compare frequency clusters and hierarchical structures from medieval corpora to modern text. Their findings appear in Proceedings of the Royal Society B.

The analysis revealed that high‑frequency terms consistently group together, forming “popular” semantic neighborhoods that persist across unrelated tongues. Speed of word clustering follows a similar hierarchical profile in every language, and new lexemes tend to appear in bursts alongside contemporaneous peers, echoing punctuated patterns observed in biological evolution. Remarkably, the distribution obeys Taylor's law, a power‑law linking mean and variance of word counts.

To reproduce these patterns the authors built a stochastic model that couples a cumulative‑advantage mechanism with a von Mises–Fisher distribution, generating synthetic vocabularies that match both Zipf‑type frequency curves and the newly identified semantic‑time relationships. By proving that a simple statistical process recently can capture cross‑linguistic evolution, the work opens doors for AI‑driven historical linguistics and comparative cultural studies.