A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks
Nikunj Saunshi, Sadhika Malladi, Sanjeev Arora

TL;DR
This paper provides a mathematical explanation for why autoregressive language models excel at downstream tasks like text classification, linking next word prediction to classification performance through formal analysis and empirical validation.
Contribution
It formalizes the connection between language modeling and classification, showing that near-optimal language models learn features useful for classification with quantifiable error bounds.
Findings
Language modeling benefits downstream classification tasks.
Models with low cross-entropy learn linearly solvable features.
A new objective function improves classification performance.
Abstract
Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Sentiment Analysis and Opinion Mining
