TL;DR
This paper investigates how language models internally represent modal categories and finds that these representations align with human judgments, improving understanding of model and human modal categorization.
Contribution
The study identifies linear modal difference vectors in LMs that reliably distinguish modal categories and correlate with human judgments, revealing emergent properties during training.
Findings
Modal difference vectors emerge consistently as models become more competent.
These vectors can predict human judgments of modal categories.
LM representations reflect human-like distinctions of possibility and impossibility.
Abstract
Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
