Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies
Xiaoliang Luo, Xinyi Xu, Michael Ramscar, Bradley C. Love

TL;DR
This paper establishes a theoretical foundation for probability invariance in large language models and empirically shows that real models exhibit ordering biases, highlighting potential inconsistencies in their learned distributions.
Contribution
It proves that sequence perplexity should be invariant under any token order and demonstrates that actual LLMs deviate from this invariance due to positional biases.
Findings
Theoretical proof of permutation invariance in sequence perplexity.
Empirical evidence of ordering biases in trained GPT-2 models.
Identification of self-attention differences as a source of bias.
Abstract
Can autoregressive large language models (LLMs) learn consistent probability distributions when trained on sequences in different token orders? We prove formally that for any well-defined probability distribution, sequence perplexity is invariant under any factorization, including forward, backward, or arbitrary permutations. This result establishes a rigorous theoretical foundation for studying how LLMs learn from data and defines principled protocols for empirical evaluation. Applying these protocols, we show that prior studies examining ordering effects suffer from critical methodological flaws. We retrain GPT-2 models across forward, backward, and arbitrary permuted orders on scientific text. We find systematic deviations from theoretical invariance across all orderings with arbitrary permutations strongly deviating from both forward and backward models, which largely (but not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Language and cultural evolution · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · Linear Layer · Weight Decay
