Truth as a Compression Artifact in Language Model Training
Konstantin Krestnikov

TL;DR
This paper investigates why language models often prefer seemingly correct answers by analyzing how compressibility of errors influences model predictions, revealing that models favor the most compressible answer rather than the truth.
Contribution
It introduces the Compression--Consistency Principle, showing that models' bias towards certain answers depends on the structural coherence of false information, not truth itself.
Findings
Models extract correct signals when errors are random, with accuracy increasing with size.
Coherent false rules can significantly reduce the model's ability to distinguish truth from falsehood.
Adding multiple conflicting rules restores the model's bias towards false answers, demonstrating the impact of structural coherence.
Abstract
Why do language models trained on contradictory data prefer correct answers? In controlled experiments with small transformers (3.5M--86M parameters), we show that this preference tracks the compressibility structure of errors rather than truth per se. We train GPT-2 style models on corpora where each mathematical problem appears with both correct and incorrect solutions -- a denoising design that directly models conflicting information about the same fact. When errors are random, models extract the correct signal with accuracy scaling from 65% to 85% with model size. When errors follow a coherent alternative rule system, accuracy drops to chance (~45--51%): the model cannot distinguish the false system from truth. A multi-rule experiment reveals a sharp crossover: a single coherent alternative rule eliminates truth bias entirely, but adding a second competing rule restores most of it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
