Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies

Xiaoliang Luo; Xinyi Xu; Michael Ramscar; Bradley C. Love

arXiv:2505.08739·cs.CL·May 14, 2025

Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies

Xiaoliang Luo, Xinyi Xu, Michael Ramscar, Bradley C. Love

PDF

Open Access 1 Repo

TL;DR

This paper establishes a theoretical foundation for probability invariance in large language models and empirically shows that real models exhibit ordering biases, highlighting potential inconsistencies in their learned distributions.

Contribution

It proves that sequence perplexity should be invariant under any token order and demonstrates that actual LLMs deviate from this invariance due to positional biases.

Findings

01

Theoretical proof of permutation invariance in sequence perplexity.

02

Empirical evidence of ordering biases in trained GPT-2 models.

03

Identification of self-attention differences as a source of bias.

Abstract

Can autoregressive large language models (LLMs) learn consistent probability distributions when trained on sequences in different token orders? We prove formally that for any well-defined probability distribution, sequence perplexity is invariant under any factorization, including forward, backward, or arbitrary permutations. This result establishes a rigorous theoretical foundation for studying how LLMs learn from data and defines principled protocols for empirical evaluation. Applying these protocols, we show that prior studies examining ordering effects suffer from critical methodological flaws. We retrain GPT-2 models across forward, backward, and arbitrary permuted orders on scientific text. We find systematic deviations from theoretical invariance across all orderings with arbitrary permutations strongly deviating from both forward and backward models, which largely (but not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

braingpt-lovelab/backwards
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Language and cultural evolution · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · Linear Layer · Weight Decay