Do Language Models Encode Knowledge of Linguistic Constraint Violations?
Hardy, Sebastian Pad\'o

TL;DR
This study investigates whether large language models encode specific representations of linguistic constraint violations and finds limited evidence supporting a unified detection mechanism within current models.
Contribution
The paper introduces a novel unsupervised framework for detecting violation-specific features in LLMs and evaluates their presence across various linguistic phenomena.
Findings
Falsification criteria are not jointly satisfied across phenomena.
No features are consistently shared across all categories.
Partial evidence of violation-specific features in some phenomena.
Abstract
Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
