AStitchInLanguageModels: Dataset and Methods for the Exploration of   Idiomaticity in Pre-Trained Language Models

Harish Tayyar Madabushi; Edward Gow-Smith; Carolina Scarton; Aline; Villavicencio

arXiv:2109.04413·cs.CL·September 10, 2021

AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models

Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline, Villavicencio

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new dataset of sentences with multiword expressions in English and Portuguese, and evaluates language models' ability to detect idiomatic usage and generate representations, highlighting areas for improvement especially in zero-shot scenarios.

Contribution

It provides a novel, manually classified dataset of MWEs in two languages and assesses language models' performance on idiomaticity detection and representation tasks.

Findings

01

Models perform well in one-shot and few-shot scenarios for idiom detection.

02

Zero-shot performance on idiom detection is significantly limited.

03

Fine-tuning improves sentence representation of MWEs.

Abstract

Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

h-tayyarmadabushi/astitchinlanguagemodels
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications