Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives
Ruchira Dhar, Qiwei Peng, Anders S{\o}gaard

TL;DR
This paper investigates how large language models handle adjective-noun compositionality, revealing a gap between their internal representations and actual task performance, emphasizing the need for contrastive evaluation methods.
Contribution
It introduces a dual evaluation approach combining functional and representational analyses to better understand LLMs' compositional abilities.
Findings
LLMs develop compositional internal representations.
Models often fail to translate representations into task success.
Contrastive evaluation is crucial for comprehensive assessment.
Abstract
Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
