Is Less More? Quality, Quantity and Context in Idiom Processing with Natural Language Models
Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski,, Norbert Slinko, Richard Johnson, Thomas Pickard, Dylan Phelps, Aline, Villavicencio

TL;DR
This paper investigates how data quality and quantity affect language models' ability to process idiomatic expressions, emphasizing the importance of dataset quality for context-aware models and the role of data quantity otherwise.
Contribution
It introduces the NCSSB datasets for idiomaticity detection and analyzes the impact of data quality and quantity on model performance with different contextual strategies.
Findings
Dataset quality significantly improves context-enriched model performance.
Data quantity influences models without contextual information.
Contextual information enhances idiomaticity detection accuracy.
Abstract
Compositionality in language models presents a problem when processing idiomatic expressions, as their meaning often cannot be directly derived from their individual parts. Although fine-tuning and other optimization strategies can be used to improve representations of idiomatic expressions, this depends on the availability of relevant data. We present the Noun Compound Synonym Substitution in Books - NCSSB - datasets, which are created by substitution of synonyms of potentially idiomatic English noun compounds in public domain book texts. We explore the trade-off between data quantity and quality when training models for idiomaticity detection, in conjunction with contextual information obtained locally (from the surrounding sentences) or externally (through language resources). Performance on an idiomaticity detection task indicates that dataset quality is a stronger factor for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · linguistics and terminology studies
