Talkie: 13B LLM with Pre-1931 Knowledge Only
Better Stackgo watch the original →
the gist
Talkie trains a 13B model on 260B tokens of pre-1931 texts to test AI reasoning without modern web contamination, enabling code learning and future predictions from scratch.
Vintage Model Design Eliminates Contamination
Talkie is a 13 billion parameter language model trained exclusively on 260 billion tokens of historical English text from old newspapers, patents, scientific journals, and books with a 1931 cutoff (end of 1930 due to US copyright). This creates a contamination-free baseline to distinguish AI reasoning from memorization, as modern models like ChatGPT, Claude, and Gemini ingest web data including Reddit threads and AI-generated content. Vintage models like Talkie lack knowledge of post-1931 events, such as World War 2, the internet (confused with internal revenue tax), or modern slang (e.g., 'bosch rot', 'fudge', 'humbug').
Reasoning Tests via Code Learning and Forecasting
Researchers provide few-shot Python examples to Talkie, which lacks any computer concept (views it as a human computator). Talkie passes basic HumanEval Python tests a few times out of 100 attempts by generating new one-line functions, such as swapping addition for subtraction in a decode function to demonstrate inverse understanding. Forecasting tests measure 'surprisingness' of New York Times 'On This Day' events post-1931, showing spikes in the 1950s-60s; performance improves with model size but decays over longer horizons. Talkie predicts no future European wars and praises a certain Austrian man as an 'extraordinary personality' for efficient German administration.
Training Challenges and Mitigations
Temporal leakage occurs, with Talkie knowing the 1936 US president and policies, possibly from misdated scans or editorial footnotes; data filtering continues to refine this. OCR on 1931 documents achieves 30% performance versus human-transcribed text, improved to 70% via rule-based fixes, with a new vintage OCR system in development. Post-training uses custom data from 1930s etiquette manuals, letter-writing guides, cookbooks, dictionaries, encyclopedias, poetry, and fables for instruction-following and conversation via reinforcement learning. Lacking 1930s judges, Claude Sonnet 3.5 judges outputs, causing style leaks like listicles, though future vintage judges may resolve this.
Future Scale and Research Value
Researchers train a GPT-3 level vintage model on a trillion tokens of historical text. These models test independent idea generation by querying post-1931 patterns or papers.