Arrows of Time for Large Language Models
Vassilis Papadopoulos, J\'er\'emie Wenger, Cl\'ement Hongler

TL;DR
This paper investigates the time asymmetry in large language models, revealing a consistent difference in their ability to predict next versus previous tokens, and provides a theoretical explanation for this phenomenon.
Contribution
It uncovers a subtle time asymmetry in LLMs' predictive abilities and offers a novel information-theoretic framework explaining its emergence due to sparsity and complexity.
Findings
Empirical evidence of time asymmetry in LLMs' perplexity scores.
Theoretical explanation linking asymmetry to sparsity and computational complexity.
Consistency of asymmetry across modalities and model sizes.
Abstract
We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
