HeadlinesBriefing favicon HeadlinesBriefing.com

talkie: A 1930s-era AI that predicts the future from the past

Hacker News •
×

A new research project has released talkie-1930-13b-base, a 13 billion parameter language model trained exclusively on pre-1931 English text. The model represents a "vintage" LM—trained only on historical data from books, newspapers, scientific journals, patents, and case law up to the end of 1930. This 260B token corpus comes from sources entering the US public domain, allowing unrestricted use.

The team tested their vintage model against a "modern twin" with identical architecture trained on FineWeb data. When given in-context examples of Python code, the 1930 model learned to write simple programs despite having no knowledge of digital computers—it successfully implemented a rotation cipher decoding function by understanding inverse functions. The researchers also examined how the model would have predicted future events, finding surprisingness increased notably in the 1950s-1960s before plateauing.

This contamination-free approach enables unique experiments about LMs' ability to generalize beyond their training data. The team plans to scale up to a GPT-3-level model and release it this summer, with estimates suggesting their corpus could grow to over a trillion historical tokens.