What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions
Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

TL;DR
This paper investigates what information embeddings in autoregressive models should encode, linking prediction objectives to sufficient statistics, and empirically shows transformers encode various latent distributions effectively.
Contribution
It connects autoregressive training to the concept of sufficient statistics and identifies three optimal embedding contents for different data settings, supported by empirical evidence.
Findings
Transformers encode latent distributions like sufficient statistics and posterior distributions.
Embeddings perform well in out-of-distribution scenarios.
Transformers do not rely on token memorization in the studied settings.
Abstract
Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
