Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

Wesley Scivetti; Melissa Torgbi; Austin Blodgett; Mollie Shichman; Taylor Hudson; Claire Bonial; Harish Tayyar Madabushi

arXiv:2501.04661·cs.CL·August 14, 2025

Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions

Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi

PDF

Open Access

TL;DR

This paper introduces a diagnostic evaluation using Construction Grammar to assess whether large language models can generalize semantic understanding beyond common training data, revealing significant limitations in their ability to distinguish meanings in syntactically identical constructions.

Contribution

The study develops a novel dataset based on Construction Grammar to systematically evaluate semantic generalization in large language models, highlighting their failure to distinguish meanings in similar syntactic forms.

Findings

01

Models perform over 40% worse on semantic tasks involving divergent meanings.

02

State-of-the-art models struggle to generalize constructional semantics.

03

The dataset and evaluation framework are publicly available.

Abstract

The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsFocus