Boundary-based MWE segmentation with text partitioning

Jake Ryland Williams

arXiv:1608.02025·cs.CL·June 12, 2017

Boundary-based MWE segmentation with text partitioning

Jake Ryland Williams

PDF

1 Repo

TL;DR

This paper introduces a boundary-based text partitioning algorithm for multiword expression segmentation, achieving state-of-the-art results across 19 languages with a fast, scalable, and open-source approach.

Contribution

The novel use of boundary tokens in a bottom-up text partitioning model significantly improves MWE segmentation performance across multiple languages and domains.

Findings

01

Outperforms recent MWE segmentation methods on shared-task data

02

Effective across 19 languages and various domains

03

Fast and scalable implementation with open-source software

Abstract

This work presents a fine-grained, text-chunking algorithm designed for the task of multiword expressions (MWEs) segmentation. As a lexical class, MWEs include a wide variety of idioms, whose automatic identification are a necessity for the handling of colloquial language. This algorithm's core novelty is its use of non-word tokens, i.e., boundaries, in a bottom-up strategy. Leveraging boundaries refines token-level information, forging high-level performance from relatively basic data. The generality of this model's feature space allows for its application across languages and domains. Experiments spanning 19 different languages exhibit a broadly-applicable, state-of-the-art model. Evaluation against recent shared-task data places text partitioning as the overall, best performing MWE segmentation algorithm, covering all MWE classes and multiple English domains (including user-generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jakerylandwilliams/partitioner
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.