CoAM: Corpus of All-Type Multiword Expressions
Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe

TL;DR
This paper introduces CoAM, a comprehensive and high-quality dataset of 1.3K sentences with multiword expressions tagged with types, enabling detailed evaluation and analysis of MWE identification methods.
Contribution
The creation of CoAM, the first dataset with all-type MWE annotations and a new annotation interface, advancing MWE research and evaluation.
Findings
A fine-tuned large language model outperforms MWEasWSD on MWE identification.
Verb MWEs are easier to identify than Noun MWEs.
CoAM enables detailed error analysis based on MWE types.
Abstract
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM's MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
