Transformers Can Represent $n$-gram Language Models

Anej Svete; Ryan Cotterell

arXiv:2404.14994·cs.CL·June 21, 2024

Transformers Can Represent $n$-gram Language Models

Anej Svete, Ryan Cotterell

PDF

1 Video

TL;DR

This paper demonstrates that transformer language models with hard or sparse attention can exactly represent any n-gram model, providing insights into their capacity to model probability distributions over strings.

Contribution

It establishes that transformer LMs with specific attention mechanisms can exactly encode n-gram models, advancing understanding of their probabilistic representational capabilities.

Findings

01

Transformer LMs with hard or sparse attention can represent any n-gram LM.

02

Provides a lower bound on the probabilistic capacity of transformer LMs.

03

First step towards understanding how transformers model string probability distributions.

Abstract

Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$ -gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$ -gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Transformers Can Represent n-gram Language Models· underline

Taxonomy

MethodsFocus