Probability Distributions Computed by Autoregressive Transformers

Andy Yang; Anej Svete; Jiaoda Li; Anthony Widjaja Lin; Jonathan Rawski; Ryan Cotterell; David Chiang

arXiv:2510.27118·cs.CL·February 23, 2026

Probability Distributions Computed by Autoregressive Transformers

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

PDF

Open Access 3 Reviews

TL;DR

This paper characterizes the probability distributions that autoregressive transformer language models can express, revealing how their expressivity changes when used as probabilistic models versus language recognizers.

Contribution

It provides a theoretical analysis of the functions and distributions that transformer language models can represent, highlighting differences from non-probabilistic transformers.

Findings

01

Autoregressive use can increase transformer expressivity.

02

Probabilistic transformers can break certain equivalences.

03

Transformers' capabilities as language models are systematically characterized.

Abstract

Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

I should start by cautioning that my expertise with the topics discussed here is limited, though I agree that understanding the expressivity of transformer models is an important research direction, given their role in powering LLMs and other modern AI advancements, even close to a decade after their initial introduction. In my opinion, the strongest merit of the work is that of showing how results obtained in the Boolean classifier setup do not necessarily port to the autoregressive setup, whi

Weaknesses

While the main result is in my opinion that of showing a discrepancy between the Boolean classifier and autoregressive setup, most of the paper is devoted to proving that many equivalence results hold in both setting, thus somewhat reducing the novelty of most contributions in the work. The broke equivalence also seems to apply rather peculiar configurations (subsets of LTL). Additionally, while considering the autoregressive setup is a step toward making the analysis more practically relevant,

Reviewer 02Rating 6Confidence 3

Strengths

- Angle is novel and interesting. I agree with the authors that there is a lack of literature on the topic of language *modelling* with Transformers. - This work improves our understanding of the interplay of expressive power between i) Boolean and Real-valued models and ii) classifier and autoregressive models. Given the equivalencies drawn by the authors, the theoretical results have deep implications about many families of models. - The authors provide extremely rigorous proofs and reductions

Weaknesses

- Although complete and precise in its writing style, I find the paper is very dense and not written in a way where key insights are easy to find/extract. See comments for actionable feedback. - The paper has no experimental validation of theoretical claims it makes. It would be nice to at least have minimal experiments to support the results. - This work makes few connections to practical settings, such as how their claims might account for empirical shortcomings of LLMs, and it does not discus

Reviewer 03Rating 6Confidence 3

Strengths

* Tightly written theoretical work with clear formal contributions. * Addresses a meaningful gap in theory: expressive power of transformers as generative language models. * Strong formal rigor, with proofs and precise definitions throughout the paper.

Weaknesses

* **Purely theoretical:** While the theoretical contributions are solid, there are no experimental results nor concrete applied examples to illustrate relevance for real-world transformer LMs. Given the conference venue, this limits perceived impact. * **Accessibility:** The paper assumes familiarity with temporal logics, weighted automata, and semirings. This is appropriate for a specialized logic or theoretical CS audience, but is demanding for the general machine-learning readership at ICLR.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis