talkie: An AI model trained on pre-1931 data

News
Wednesday, 29 April 2026 at 07:00
talkie Een AI-model getraind op data van vóór 1931
In April 2026, researchers led by Nick Levine unveiled a striking new AI model: talkie-1930, a language model trained exclusively on texts from before 1931. The project, which also involved David Duvenaud and Alec Radford, shows how artificial intelligence behaves without modern knowledge or internet data.
The model offers a rare window into AI development and immediately raises questions about data, bias, and the future of artificial intelligence.

What is talkie-1930, and why does it matter?

Talkie-1930 is a “vintage language model” that uses only historical texts. It was trained on 260 billion tokens from books, newspapers, and documents published before 1931, leaving it unaware of modern events or technologies.
That approach makes the model fundamentally different from today’s AI systems. While modern models lean on internet-scale data, Talkie shows how AI operates without that influence. It’s a valuable experiment for researchers seeking to understand how data shapes AI outputs.

What makes this model technically notable?

With 13 billion parameters, Talkie is the largest model of its kind. Architecturally it resembles modern systems, but its training data is entirely different.
Key technical takeaways:
  • It underperforms on factual knowledge compared to modern AI
  • The gap halves when “anachronistic” questions are removed
  • It shows surprisingly strong language skills despite limited data
  • It can handle simple programming tasks via examples
These results suggest language competence can be decoupled from up-to-date knowledge. That matters for AI development in domains where reliability and control are critical, such as government and education.

Why are ‘vintage’ AI models compelling?

Vintage models offer a controlled testbed for AI research. Because they exclude modern data, they are free from “data contamination,” the well-known issue where models regurgitate answers from their training sets.
This opens up new possibilities:
  • Cleaner evaluation of generalization
  • Insight into how AI “discovers” new knowledge
  • Comparisons across datasets and time periods
Researchers use Talkie to test whether a model can predict or reconstruct future inventions based on past knowledge—think of theories by Albert Einstein or early computer science concepts.

What are the limits and risks?

Talkie’s limitations are immediately visible. The model mirrors early 20th-century norms and values. That means:
  • Gender roles often appear traditional
  • Social inequality is implicitly normalized
  • Modern perspectives are entirely absent
Data quality also plays a major role. Historical texts are often digitized via OCR, which introduces errors and can reduce performance to roughly 30 percent of optimal.
These constraints underline how deeply AI depends on its training data.

What’s next for the project?

The team aims to scale Talkie quickly. They’re working on:
  • Larger datasets, potentially exceeding 1 trillion tokens
  • Better OCR technology for historical texts
  • Multilingual expansion
  • New evaluation methods for AI predictive capabilities
The end goal: a model at roughly GPT-3.5 level—built entirely on historical data.

Conclusion: Looking back to push AI forward

Talkie-1930 proves that AI progress isn’t just about more data—it’s about different data. By mining the past, researchers gain sharper insight into how language models work, where bias emerges, and how AI evolves.
For the Netherlands, it’s a chance to assess AI more critically and strategically—not just what the technology can do, but what it learns from the world we feed it.
loading

Loading