TL;DR
This paper introduces a boundary-based text partitioning algorithm for multiword expression segmentation, achieving state-of-the-art results across 19 languages with a fast, scalable, and open-source approach.
Contribution
The novel use of boundary tokens in a bottom-up text partitioning model significantly improves MWE segmentation performance across multiple languages and domains.
Findings
Outperforms recent MWE segmentation methods on shared-task data
Effective across 19 languages and various domains
Fast and scalable implementation with open-source software
Abstract
This work presents a fine-grained, text-chunking algorithm designed for the task of multiword expressions (MWEs) segmentation. As a lexical class, MWEs include a wide variety of idioms, whose automatic identification are a necessity for the handling of colloquial language. This algorithm's core novelty is its use of non-word tokens, i.e., boundaries, in a bottom-up strategy. Leveraging boundaries refines token-level information, forging high-level performance from relatively basic data. The generality of this model's feature space allows for its application across languages and domains. Experiments spanning 19 different languages exhibit a broadly-applicable, state-of-the-art model. Evaluation against recent shared-task data places text partitioning as the overall, best performing MWE segmentation algorithm, covering all MWE classes and multiple English domains (including user-generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
