AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models
Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline, Villavicencio

TL;DR
This paper introduces a new dataset of sentences with multiword expressions in English and Portuguese, and evaluates language models' ability to detect idiomatic usage and generate representations, highlighting areas for improvement especially in zero-shot scenarios.
Contribution
It provides a novel, manually classified dataset of MWEs in two languages and assesses language models' performance on idiomaticity detection and representation tasks.
Findings
Models perform well in one-shot and few-shot scenarios for idiom detection.
Zero-shot performance on idiom detection is significantly limited.
Fine-tuning improves sentence representation of MWEs.
Abstract
Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
