CoAM: Corpus of All-Type Multiword Expressions

Yusuke Ide; Joshua Tanner; Adam Nohejl; Jacob Hoffman; Justin Vasselli; Hidetaka Kamigaito; Taro Watanabe

arXiv:2412.18151·cs.CL·July 11, 2025

CoAM: Corpus of All-Type Multiword Expressions

Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces CoAM, a comprehensive and high-quality dataset of 1.3K sentences with multiword expressions tagged with types, enabling detailed evaluation and analysis of MWE identification methods.

Contribution

The creation of CoAM, the first dataset with all-type MWE annotations and a new annotation interface, advancing MWE research and evaluation.

Findings

01

A fine-tuned large language model outperforms MWEasWSD on MWE identification.

02

Verb MWEs are easier to identify than Noun MWEs.

03

CoAM enables detailed error analysis based on MWE types.

Abstract

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM's MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yusuke196/CoAM
dataset· 6 dl
6 dl

Videos

CoAM: Corpus of All-Type Multiword Expressions· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling